You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Margi Patel <ma...@usc.edu> on 2014/03/16 19:36:53 UTC

Use of Levenshtein distance to find similar words

Hello Professor Mattmann,

I have completed the basic requirements of TIKA assignment ( without OCR
quality check) and now I want to go for the extra edit part. I plan to use
Levenshtein distance implemented in apache's commons-lang3-3.1.jar file.

I tried the following :
---------------------------
After I extract all of the text from each PDF file, I need to find out
Levenshtein distance between each of the keywords in my set of '11 keywords'
and the extracted text.
Since the extracted text is a very long string, I thought to split this text
on new line character("\n"). For each line, I compute the edit distance
keeping the threshold very low.

However, this does not seem to be the correct approach since the extracted
text contains a good amount of junk  characters due to OCR noise and error.
I need to do some pre-processing on the extracted text first.

Pointers along the right direction/approach will greatly help. 

Thanks !
-Margi

Re: Use of Levenshtein distance to find similar words

Posted by Chris Mattmann <ma...@apache.org>.

Dear Margi,

Great question and thanks for posting this to the list! :)

You may also want to split your extracted text not just by "\n" but
also look to split by perhaps " " to canonical-ize the words. You may
even think of an approach for creating words (recall we discussed a method
in class for considering N-grams). This should make your process for
using commons-lang and edit distance easier. For the OCR help, check
out TIKA-93 [1] and the work going on there.

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-93





-----Original Message-----
From: Margi Patel <ma...@usc.edu>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Sunday, March 16, 2014 11:36 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Use of Levenshtein distance to find similar words

>Hello Professor Mattmann,
>
>I have completed the basic requirements of TIKA assignment ( without OCR
>quality check) and now I want to go for the extra edit part. I plan to use
>Levenshtein distance implemented in apache's commons-lang3-3.1.jar file.
>
>I tried the following :
>---------------------------
>After I extract all of the text from each PDF file, I need to find out
>Levenshtein distance between each of the keywords in my set of '11
>keywords'
>and the extracted text.
>Since the extracted text is a very long string, I thought to split this
>text
>on new line character("\n"). For each line, I compute the edit distance
>keeping the threshold very low.
>
>However, this does not seem to be the correct approach since the extracted
>text contains a good amount of junk  characters due to OCR noise and
>error.
>I need to do some pre-processing on the extracted text first.
>
>Pointers along the right direction/approach will greatly help.
>
>Thanks !
>-Margi
>
>
>