You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Markus Wiederkehr <ma...@gmail.com> on 2005/06/13 13:08:04 UTC

Hypenated word

Hello,

I work on an application that has to index OCR texts of scanned books.
Naturally there occur many words that are hyphenated across lines.

I wonder if there is already an Analyzer or maybe a TokenFilter that
can merge those syllables back into whole words? It looks like Erik
Hatcher uses something like that at http://www.lucenebook.com/.

Thanks in advance,

Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hypenated word

Posted by Andy Roberts <ma...@andy-roberts.net>.

On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:
> On 6/13/05, Andy Roberts <ma...@andy-roberts.net> wrote:
> > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > > I see, the list of exceptions makes this a lot more complicated than I
> > > thought... Thanks a lot, Erik!
> >
> > I expect you'll need to do some pre-processing. Read in your text into a
> > buffer, line-by-line. If a given line ends with a hyphen, you can
> > manipulate the buffer to merge the hyphenated tokens.
>
> As Erik wrote it is not that simple, unfortunately. For example, if
> one line ends with "read-" and the next line begins with "only" the
> correct word is "read-only" not "readonly". Whereas "work-" and "ing"
> should of course be merged into "working".
>
> Markus

Perhaps you do some crude checking against a dictionary. Combine the word 
anyway and check if it's in the dictionary. If so, keep it merged otherwise, 
it's a compound and so revert back to the hyphenated form.

Word lists come part of all good OSS dictionary projects, as well as other 
language resources, like the BNC word lists etc.

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hypenated word

Posted by Markus Wiederkehr <ma...@gmail.com>.

On 6/13/05, Andy Roberts <ma...@andy-roberts.net> wrote:
> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > I see, the list of exceptions makes this a lot more complicated than I
> > thought... Thanks a lot, Erik!
> >
> 
> I expect you'll need to do some pre-processing. Read in your text into a
> buffer, line-by-line. If a given line ends with a hyphen, you can manipulate
> the buffer to merge the hyphenated tokens.

As Erik wrote it is not that simple, unfortunately. For example, if
one line ends with "read-" and the next line begins with "only" the
correct word is "read-only" not "readonly". Whereas "work-" and "ing"
should of course be merged into "working".

Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hypenated word

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote:

> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
>
>> I see, the list of exceptions makes this a lot more complicated  
>> than I
>> thought... Thanks a lot, Erik!
>>
>>
>
> I expect you'll need to do some pre-processing. Read in your text  
> into a
> buffer, line-by-line. If a given line ends with a hyphen, you can  
> manipulate
> the buffer to merge the hyphenated tokens.

The problem I encountered when indexing "Lucene in Action" was that I  
couldn't just blindly concatenate two tokens because the first ends  
with a hyphen.  Some lines ended with a hyphen because it was a dash,  
not a hyphenated word.

I'm sure other more clever implementations could do this better, by  
looking up the concatenated word in a dictionary for instance.

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hypenated word

Posted by Andy Roberts <ma...@andy-roberts.net>.

On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!
>

I expect you'll need to do some pre-processing. Read in your text into a 
buffer, line-by-line. If a given line ends with a hyphen, you can manipulate 
the buffer to merge the hyphenated tokens.

Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hypenated word

Posted by "Peter A. Friend" <oc...@corp.earthlink.net>.

On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote:

> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!

There is a section about the problems that hyphens create in  
"Foundations of Statistical Natural Language Processing". Not only  
are the cases numerous, but seemingly simple rules such as joining  
hyphenated forms at the ends of lines does not always work. Sometimes  
the hyphen was added to break the word, sometimes you are already  
dealing with a hyphenated form that just happened to occur at the end  
of a line, so the hyphen serves two purposes. I've toyed with the  
idea of indexing hyphenated words in their raw as well as split  
forms, but I think that would wreak havoc on the word position stuff,  
as well as bloat the index with potentially meaningless gibberish.

Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hypenated word

Posted by Markus Wiederkehr <ma...@gmail.com>.

I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!

Markus

On 6/13/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> 
> On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> > I work on an application that has to index OCR texts of scanned books.
> > Naturally there occur many words that are hyphenated across lines.
> >
> > I wonder if there is already an Analyzer or maybe a TokenFilter that
> > can merge those syllables back into whole words? It looks like Erik
> > Hatcher uses something like that at http://www.lucenebook.com/.
> 
> Markus - you're right, I did develop something to handle hyphenated
> words for lucenebook.com.  It was sort of a hack in that I had to
> build in a static list of exceptions in how I handled this, so you'll
> likely have to use caution as well.  The LiaAnalyzer is this:
> 
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      TokenFilter filter = new DashSplitterFilter(
>                new HyphenatedFilter(
>                  new DashDashFilter(
>                    new LiaTokenizer(reader))));
> 
>      filter = new LengthFilter(3, filter);
>      filter = new StopFilter(filter, stopSet);
> 
>      if (stem) {
>        filter = new SnowballFilter(filter, "English");
>      }
> 
>      return filter;
>    }
> 
> 
> And my HyphenatedFilter is this:
> 
> public class HyphenatedFilter extends TokenFilter {
>    private HashMap exceptions = new HashMap();
> 
>    private static final String[] EXCEPTION_LIST = {
>       "full-text", "information-retrieval", "license-code", "old-
> fashioned",
>       "well-designed", "free-form", "file-based", "ramdirectory-
> based", "ram-based",
>       "index-modifying", "read-only",
>       "top-scoring", "most-recently-used", "queryparser-parsed",
>       "in-order", "per-document", "lower-caser", "domain-specific",
> "high-level",
>       "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
>       "date-range", "computation-intensive", "hits-returning", "lower-
> level",
>       "number-padding", "utf-address-book", "third-party", "plain-
> text", "google-like",
>       "re-add", "english-specific", "file-handling", "already-
> created", "d-add", "d-add",
>       "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",
> "porteranalyzer-new",
>       "writer-set", "document-new", "doc-add", "field-keyword",
> "field-unstored", "writer-add",
>       "writer-optimize", "queryparser-new", "porteranalyzer-new",
> "parser-parse", "indexsearcher-new",
>       "hitcollector-new", "searcher-doc", "searcher-search", "jakarta-
> lucene", "www-ibm", "java-specific",
>       "non-java", "vis--vis", "medium-sized", "browser-based", "utf-
> before", "concept-based",
>       "natural-language", "queue-based", "high-likelihood", "slp-or",
> "noisy-channel", "al-rasheed",
>       "hands-free", "top-notch", "google-esque", "search-config",
> "java-related",
>       "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",
> "lucene-web", "lucene-webindex",
>       "command-line", "lucene-version", "issue-tracking"
>    };
> 
>    protected HyphenatedFilter(TokenStream tokenStream) {
>      super(tokenStream);
> 
>      for (int i = 0; i < EXCEPTION_LIST.length; i++) {
>        exceptions.put(EXCEPTION_LIST[i], "");
>      }
>    }
> 
>    private Token savedToken;
> 
>    public Token next() throws IOException {
> 
>      if (savedToken != null) {
>        Token token = savedToken;
>        savedToken = null;
>        return token;
>      }
> 
>      Token firstToken = input.next();
> 
>      if (firstToken == null)
>        return firstToken;
> 
> 
>      if (firstToken.termText().endsWith("-")) {
>        String firstPart;
>        firstPart = firstToken.termText();
> 
>        // consume next token
>        Token secondToken = input.next();
>        if (secondToken == null)
>          return firstToken;
> 
>        String termText = firstPart.substring(0, firstPart.length() -
> 1) + secondToken.termText();
> 
>        if (exceptions.containsKey(firstPart + secondToken.termText())) {
>          savedToken = secondToken;
>          return firstToken;
>        }
> 
>        return new Token(termText, firstToken.startOffset(),
> firstToken.endOffset() + secondToken.termText().length() + 1);
>      }
> 
>      return firstToken;
>    }
> }
> 
> Not all that pretty, I'm afraid, but by all means use it if its useful.
> 
>      Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


-- 
Always remember you're unique. Just like everyone else.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Hypenated word

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> I work on an application that has to index OCR texts of scanned books.
> Naturally there occur many words that are hyphenated across lines.
>
> I wonder if there is already an Analyzer or maybe a TokenFilter that
> can merge those syllables back into whole words? It looks like Erik
> Hatcher uses something like that at http://www.lucenebook.com/.

Markus - you're right, I did develop something to handle hyphenated  
words for lucenebook.com.  It was sort of a hack in that I had to  
build in a static list of exceptions in how I handled this, so you'll  
likely have to use caution as well.  The LiaAnalyzer is this:

   public TokenStream tokenStream(String fieldName, Reader reader) {
     TokenFilter filter = new DashSplitterFilter(
               new HyphenatedFilter(
                 new DashDashFilter(
                   new LiaTokenizer(reader))));

     filter = new LengthFilter(3, filter);
     filter = new StopFilter(filter, stopSet);

     if (stem) {
       filter = new SnowballFilter(filter, "English");
     }

     return filter;
   }


And my HyphenatedFilter is this:

public class HyphenatedFilter extends TokenFilter {
   private HashMap exceptions = new HashMap();

   private static final String[] EXCEPTION_LIST = {
      "full-text", "information-retrieval", "license-code", "old- 
fashioned",
      "well-designed", "free-form", "file-based", "ramdirectory- 
based", "ram-based",
      "index-modifying", "read-only",
      "top-scoring", "most-recently-used", "queryparser-parsed",
      "in-order", "per-document", "lower-caser", "domain-specific",  
"high-level",
      "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
      "date-range", "computation-intensive", "hits-returning", "lower- 
level",
      "number-padding", "utf-address-book", "third-party", "plain- 
text", "google-like",
      "re-add", "english-specific", "file-handling", "already- 
created", "d-add", "d-add",
      "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",  
"porteranalyzer-new",
      "writer-set", "document-new", "doc-add", "field-keyword",  
"field-unstored", "writer-add",
      "writer-optimize", "queryparser-new", "porteranalyzer-new",  
"parser-parse", "indexsearcher-new",
      "hitcollector-new", "searcher-doc", "searcher-search", "jakarta- 
lucene", "www-ibm", "java-specific",
      "non-java", "vis--vis", "medium-sized", "browser-based", "utf- 
before", "concept-based",
      "natural-language", "queue-based", "high-likelihood", "slp-or",  
"noisy-channel", "al-rasheed",
      "hands-free", "top-notch", "google-esque", "search-config",  
"java-related",
      "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",  
"lucene-web", "lucene-webindex",
      "command-line", "lucene-version", "issue-tracking"
   };

   protected HyphenatedFilter(TokenStream tokenStream) {
     super(tokenStream);

     for (int i = 0; i < EXCEPTION_LIST.length; i++) {
       exceptions.put(EXCEPTION_LIST[i], "");
     }
   }

   private Token savedToken;

   public Token next() throws IOException {

     if (savedToken != null) {
       Token token = savedToken;
       savedToken = null;
       return token;
     }

     Token firstToken = input.next();

     if (firstToken == null)
       return firstToken;


     if (firstToken.termText().endsWith("-")) {
       String firstPart;
       firstPart = firstToken.termText();

       // consume next token
       Token secondToken = input.next();
       if (secondToken == null)
         return firstToken;

       String termText = firstPart.substring(0, firstPart.length() -  
1) + secondToken.termText();

       if (exceptions.containsKey(firstPart + secondToken.termText())) {
         savedToken = secondToken;
         return firstToken;
       }

       return new Token(termText, firstToken.startOffset(),  
firstToken.endOffset() + secondToken.termText().length() + 1);
     }

     return firstToken;
   }
}

Not all that pretty, I'm afraid, but by all means use it if its useful.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org