You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Markus Wiederkehr <ma...@gmail.com> on 2005/06/13 13:08:04 UTC
Hypenated word
Hello,
I work on an application that has to index OCR texts of scanned books.
Naturally there occur many words that are hyphenated across lines.
I wonder if there is already an Analyzer or maybe a TokenFilter that
can merge those syllables back into whole words? It looks like Erik
Hatcher uses something like that at http://www.lucenebook.com/.
Thanks in advance,
Markus
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hypenated word
Posted by Andy Roberts <ma...@andy-roberts.net>.
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:
> On 6/13/05, Andy Roberts <ma...@andy-roberts.net> wrote:
> > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > > I see, the list of exceptions makes this a lot more complicated than I
> > > thought... Thanks a lot, Erik!
> >
> > I expect you'll need to do some pre-processing. Read in your text into a
> > buffer, line-by-line. If a given line ends with a hyphen, you can
> > manipulate the buffer to merge the hyphenated tokens.
>
> As Erik wrote it is not that simple, unfortunately. For example, if
> one line ends with "read-" and the next line begins with "only" the
> correct word is "read-only" not "readonly". Whereas "work-" and "ing"
> should of course be merged into "working".
>
> Markus
Perhaps you do some crude checking against a dictionary. Combine the word
anyway and check if it's in the dictionary. If so, keep it merged otherwise,
it's a compound and so revert back to the hyphenated form.
Word lists come part of all good OSS dictionary projects, as well as other
language resources, like the BNC word lists etc.
Andy
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hypenated word
Posted by Markus Wiederkehr <ma...@gmail.com>.
On 6/13/05, Andy Roberts <ma...@andy-roberts.net> wrote:
> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > I see, the list of exceptions makes this a lot more complicated than I
> > thought... Thanks a lot, Erik!
> >
>
> I expect you'll need to do some pre-processing. Read in your text into a
> buffer, line-by-line. If a given line ends with a hyphen, you can manipulate
> the buffer to merge the hyphenated tokens.
As Erik wrote it is not that simple, unfortunately. For example, if
one line ends with "read-" and the next line begins with "only" the
correct word is "read-only" not "readonly". Whereas "work-" and "ing"
should of course be merged into "working".
Markus
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hypenated word
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote:
> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
>
>> I see, the list of exceptions makes this a lot more complicated
>> than I
>> thought... Thanks a lot, Erik!
>>
>>
>
> I expect you'll need to do some pre-processing. Read in your text
> into a
> buffer, line-by-line. If a given line ends with a hyphen, you can
> manipulate
> the buffer to merge the hyphenated tokens.
The problem I encountered when indexing "Lucene in Action" was that I
couldn't just blindly concatenate two tokens because the first ends
with a hyphen. Some lines ended with a hyphen because it was a dash,
not a hyphenated word.
I'm sure other more clever implementations could do this better, by
looking up the concatenated word in a dictionary for instance.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hypenated word
Posted by Andy Roberts <ma...@andy-roberts.net>.
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!
>
I expect you'll need to do some pre-processing. Read in your text into a
buffer, line-by-line. If a given line ends with a hyphen, you can manipulate
the buffer to merge the hyphenated tokens.
Andy
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hypenated word
Posted by "Peter A. Friend" <oc...@corp.earthlink.net>.
On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote:
> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!
There is a section about the problems that hyphens create in
"Foundations of Statistical Natural Language Processing". Not only
are the cases numerous, but seemingly simple rules such as joining
hyphenated forms at the ends of lines does not always work. Sometimes
the hyphen was added to break the word, sometimes you are already
dealing with a hyphenated form that just happened to occur at the end
of a line, so the hyphen serves two purposes. I've toyed with the
idea of indexing hyphenated words in their raw as well as split
forms, but I think that would wreak havoc on the word position stuff,
as well as bloat the index with potentially meaningless gibberish.
Peter
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hypenated word
Posted by Markus Wiederkehr <ma...@gmail.com>.
I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!
Markus
On 6/13/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> > I work on an application that has to index OCR texts of scanned books.
> > Naturally there occur many words that are hyphenated across lines.
> >
> > I wonder if there is already an Analyzer or maybe a TokenFilter that
> > can merge those syllables back into whole words? It looks like Erik
> > Hatcher uses something like that at http://www.lucenebook.com/.
>
> Markus - you're right, I did develop something to handle hyphenated
> words for lucenebook.com. It was sort of a hack in that I had to
> build in a static list of exceptions in how I handled this, so you'll
> likely have to use caution as well. The LiaAnalyzer is this:
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
> TokenFilter filter = new DashSplitterFilter(
> new HyphenatedFilter(
> new DashDashFilter(
> new LiaTokenizer(reader))));
>
> filter = new LengthFilter(3, filter);
> filter = new StopFilter(filter, stopSet);
>
> if (stem) {
> filter = new SnowballFilter(filter, "English");
> }
>
> return filter;
> }
>
>
> And my HyphenatedFilter is this:
>
> public class HyphenatedFilter extends TokenFilter {
> private HashMap exceptions = new HashMap();
>
> private static final String[] EXCEPTION_LIST = {
> "full-text", "information-retrieval", "license-code", "old-
> fashioned",
> "well-designed", "free-form", "file-based", "ramdirectory-
> based", "ram-based",
> "index-modifying", "read-only",
> "top-scoring", "most-recently-used", "queryparser-parsed",
> "in-order", "per-document", "lower-caser", "domain-specific",
> "high-level",
> "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
> "date-range", "computation-intensive", "hits-returning", "lower-
> level",
> "number-padding", "utf-address-book", "third-party", "plain-
> text", "google-like",
> "re-add", "english-specific", "file-handling", "already-
> created", "d-add", "d-add",
> "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",
> "porteranalyzer-new",
> "writer-set", "document-new", "doc-add", "field-keyword",
> "field-unstored", "writer-add",
> "writer-optimize", "queryparser-new", "porteranalyzer-new",
> "parser-parse", "indexsearcher-new",
> "hitcollector-new", "searcher-doc", "searcher-search", "jakarta-
> lucene", "www-ibm", "java-specific",
> "non-java", "vis--vis", "medium-sized", "browser-based", "utf-
> before", "concept-based",
> "natural-language", "queue-based", "high-likelihood", "slp-or",
> "noisy-channel", "al-rasheed",
> "hands-free", "top-notch", "google-esque", "search-config",
> "java-related",
> "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",
> "lucene-web", "lucene-webindex",
> "command-line", "lucene-version", "issue-tracking"
> };
>
> protected HyphenatedFilter(TokenStream tokenStream) {
> super(tokenStream);
>
> for (int i = 0; i < EXCEPTION_LIST.length; i++) {
> exceptions.put(EXCEPTION_LIST[i], "");
> }
> }
>
> private Token savedToken;
>
> public Token next() throws IOException {
>
> if (savedToken != null) {
> Token token = savedToken;
> savedToken = null;
> return token;
> }
>
> Token firstToken = input.next();
>
> if (firstToken == null)
> return firstToken;
>
>
> if (firstToken.termText().endsWith("-")) {
> String firstPart;
> firstPart = firstToken.termText();
>
> // consume next token
> Token secondToken = input.next();
> if (secondToken == null)
> return firstToken;
>
> String termText = firstPart.substring(0, firstPart.length() -
> 1) + secondToken.termText();
>
> if (exceptions.containsKey(firstPart + secondToken.termText())) {
> savedToken = secondToken;
> return firstToken;
> }
>
> return new Token(termText, firstToken.startOffset(),
> firstToken.endOffset() + secondToken.termText().length() + 1);
> }
>
> return firstToken;
> }
> }
>
> Not all that pretty, I'm afraid, but by all means use it if its useful.
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Always remember you're unique. Just like everyone else.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hypenated word
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> I work on an application that has to index OCR texts of scanned books.
> Naturally there occur many words that are hyphenated across lines.
>
> I wonder if there is already an Analyzer or maybe a TokenFilter that
> can merge those syllables back into whole words? It looks like Erik
> Hatcher uses something like that at http://www.lucenebook.com/.
Markus - you're right, I did develop something to handle hyphenated
words for lucenebook.com. It was sort of a hack in that I had to
build in a static list of exceptions in how I handled this, so you'll
likely have to use caution as well. The LiaAnalyzer is this:
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenFilter filter = new DashSplitterFilter(
new HyphenatedFilter(
new DashDashFilter(
new LiaTokenizer(reader))));
filter = new LengthFilter(3, filter);
filter = new StopFilter(filter, stopSet);
if (stem) {
filter = new SnowballFilter(filter, "English");
}
return filter;
}
And my HyphenatedFilter is this:
public class HyphenatedFilter extends TokenFilter {
private HashMap exceptions = new HashMap();
private static final String[] EXCEPTION_LIST = {
"full-text", "information-retrieval", "license-code", "old-
fashioned",
"well-designed", "free-form", "file-based", "ramdirectory-
based", "ram-based",
"index-modifying", "read-only",
"top-scoring", "most-recently-used", "queryparser-parsed",
"in-order", "per-document", "lower-caser", "domain-specific",
"high-level",
"utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
"date-range", "computation-intensive", "hits-returning", "lower-
level",
"number-padding", "utf-address-book", "third-party", "plain-
text", "google-like",
"re-add", "english-specific", "file-handling", "already-
created", "d-add", "d-add",
"hits-length", "hits-doc", "hits-score", "d-get", "writer-new",
"porteranalyzer-new",
"writer-set", "document-new", "doc-add", "field-keyword",
"field-unstored", "writer-add",
"writer-optimize", "queryparser-new", "porteranalyzer-new",
"parser-parse", "indexsearcher-new",
"hitcollector-new", "searcher-doc", "searcher-search", "jakarta-
lucene", "www-ibm", "java-specific",
"non-java", "vis--vis", "medium-sized", "browser-based", "utf-
before", "concept-based",
"natural-language", "queue-based", "high-likelihood", "slp-or",
"noisy-channel", "al-rasheed",
"hands-free", "top-notch", "google-esque", "search-config",
"java-related",
"lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",
"lucene-web", "lucene-webindex",
"command-line", "lucene-version", "issue-tracking"
};
protected HyphenatedFilter(TokenStream tokenStream) {
super(tokenStream);
for (int i = 0; i < EXCEPTION_LIST.length; i++) {
exceptions.put(EXCEPTION_LIST[i], "");
}
}
private Token savedToken;
public Token next() throws IOException {
if (savedToken != null) {
Token token = savedToken;
savedToken = null;
return token;
}
Token firstToken = input.next();
if (firstToken == null)
return firstToken;
if (firstToken.termText().endsWith("-")) {
String firstPart;
firstPart = firstToken.termText();
// consume next token
Token secondToken = input.next();
if (secondToken == null)
return firstToken;
String termText = firstPart.substring(0, firstPart.length() -
1) + secondToken.termText();
if (exceptions.containsKey(firstPart + secondToken.termText())) {
savedToken = secondToken;
return firstToken;
}
return new Token(termText, firstToken.startOffset(),
firstToken.endOffset() + secondToken.termText().length() + 1);
}
return firstToken;
}
}
Not all that pretty, I'm afraid, but by all means use it if its useful.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org