You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by dhaivat dave <dh...@gmail.com> on 2013/08/12 15:29:05 UTC

developing custom tokenizer

Hello All,

I want to create custom tokeniser in solr 4.4.  it will be very helpful if
some one share any tutorials or information on this.


Many Thanks,
Dhaivat Dave

Re: developing custom tokenizer

Posted by dhaivat dave <dh...@gmail.com>.

Hi Alex,

Thanks for your reply and i looked into core analyser and also created
custom tokeniser using that.I have shared code below. when i tried to look
into analysis of solr, the analyser is working fine but when i tried to
submit 100 docs together i found in logs (with custom message printing)
 that for some of the document it's not calling "create" method from
SampleTokeniserFactory (please see code below).

can you please help me out what's wrong in following code. am i missing
something?

here is the class which extends TokeniserFactory class

=== SampleTokeniserFactory.java

public class SampleTokeniserFactory extends TokenizerFactory {

public SampleTokeniserFactory(Map<String, String> args) {
super(args);
}

public SampleTokeniser create(AttributeFactory factory, Reader reader) {
return new SampleTokeniser(factory, reader);
}

}

here is the class which extends Tokenizer class
====

package ns.solr.analyser;

import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

public class SampleTokeniser extends Tokenizer {

private List<Token> tokenList = new ArrayList<Token>();

int tokenCounter = -1;

private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);

/**
 * Object that defines the offset attribute
 */
private final OffsetAttribute offsetAttribute = (OffsetAttribute)
addAttribute(OffsetAttribute.class);

/**
 * Object that defines the position attribute
 */
private final PositionIncrementAttribute position =
(PositionIncrementAttribute) addAttribute(PositionIncrementAttribute.class);

public SampleTokeniser(AttributeFactory factory, Reader reader) {
super(factory, reader);
String textToProcess = null;
try {
textToProcess = readFully(reader);
processText(textToProcess);
} catch (IOException e) {
e.printStackTrace();
}

}

public String readFully(Reader reader) throws IOException {
char[] arr = new char[8 * 1024]; // 8K at a time
StringBuffer buf = new StringBuffer();
int numChars;
while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
return buf.toString();
}

public void processText(String textToProcess) {

String wordsList[] = textToProcess.split(" ");

int startOffset = 0, endOffset = 0;

for (String word : wordsList) {

endOffset = word.length();

Token aToken = new Token("Token." + word, startOffset, endOffset);

aToken.setPositionIncrement(1);

tokenList.add(aToken);

startOffset = endOffset + 1;
}
}

@Override
public boolean incrementToken() throws IOException {

clearAttributes();
tokenCounter++;

if (tokenCounter < tokenList.size()) {
Token aToken = tokenList.get(tokenCounter);

termAtt.append(aToken);
termAtt.setLength(aToken.length());
offsetAttribute.setOffset(correctOffset(aToken.startOffset()),
correctOffset(aToken.endOffset()));
position.setPositionIncrement(aToken.getPositionIncrement());
return true;
}

return false;
}

/**
 * close object
 *
 * @throws IOException
 */
public void close() throws IOException {
super.close();
System.out.println("Close method called");

}

/**
 * called when end method gets called
 *
 * @throws IOException
 */
public void end() throws IOException {
super.end();
// setting final offset
System.out.println("end called with final offset");
}

/**
 * method reset the record
 *
 * @throws IOException
 */
public void reset() throws IOException {
super.reset();
System.out.println("Reset Called");
tokenCounter = -1;

}
}


Many Thanks,
Dhaivat


On Mon, Aug 12, 2013 at 7:03 PM, Alexandre Rafalovitch
<ar...@gmail.com>wrote:

> Have you tried looking at source code itself? Between simple organizer like
> keyword and complex language ones, you should be able to get an idea. Then
> ask specific follow up questions.
>
> Regards,
>      Alex
> On 12 Aug 2013 09:29, "dhaivat dave" <dh...@gmail.com> wrote:
>
> > Hello All,
> >
> > I want to create custom tokeniser in solr 4.4.  it will be very helpful
> if
> > some one share any tutorials or information on this.
> >
> >
> > Many Thanks,
> > Dhaivat Dave
> >
>



-- 







Regards
Dhaivat

Re: developing custom tokenizer

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Have you tried looking at source code itself? Between simple organizer like
keyword and complex language ones, you should be able to get an idea. Then
ask specific follow up questions.

Regards,
     Alex
On 12 Aug 2013 09:29, "dhaivat dave" <dh...@gmail.com> wrote:

> Hello All,
>
> I want to create custom tokeniser in solr 4.4.  it will be very helpful if
> some one share any tutorials or information on this.
>
>
> Many Thanks,
> Dhaivat Dave
>