You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jonathan Ciampi <jo...@yahoo.com> on 2010/09/30 22:58:26 UTC

Looking for advice on using Lucene to semantically compare two documents

Advice on comparing two documents.
 
Summary
This project is not a search engine but a semantic comparison between two 
documents.  The purpose of this application is to assist users in modifying the 
text in a document to improve the relevancy rank of the document to another 
document.  For example, the user would want to compare Document A to Document B 
to identify the text in Document A that has relevancy to Document B.  Then, the 
user would want the ability to identify the text to modify to improve the 
relevancy rating.    
 
 
Description: 
 
Both documents are XML with tags identifying the keywords or blocks of text in 
the document.  

 
Sample Structure
 
Document A
<DocumentName>DocumentA</DocumentName>
<Keyword>This is keyword 1</Keyword> 
<Keyword>Keywords can be any length</Keyword> 
<Keyword>Some keywords will match Document B</Keyword> 
<Keyword>Some keywords will not match</Keyword> 
<Keyword>Keywords can contain text, numbers, and symbols</Keyword> 
 
Document B
<DocumentName>DocumentB</DocumentName>
<Keyword>This is Document B keyword 1</Keyword> 
<Keyword>Document B serves as the basis or standard for comparing</Keyword> 
<Keyword>Document A will be modified by the user to match the keywords in 
Document B</Keyword> 

<Keyword>Document A and Document B will always be compared to each 
other</Keyword> 

<Keyword>This application is to help users add text, numbers and symbols to 
improve their relevancy ranking</Keyword> 

 
We believe we need to use Lucene to do semantic searches to determine 
relevance.  Our preferred output would be to show a user the words from each 
document with their relevancy.  To remove excessive data, the output would show 
all keywords from Document B, and only those with a relevancy ranking  from 
Document A.
 
Sample Output
 
Document B Document A Relevancy 
This is Document B keyword 1 This is keyword 1 .25 
This is Document B keyword 1 Keywords can be any length .25 
This is Document B keyword 1 Some keywords will match Document B .25 
This is Document B keyword 1 Some keywords will not match .25 
This is Document B keyword 1 Keywords can contain text, numbers, and symbols .25 

Document B serves as the basis or standard for comparing Some keywords will 
match Document B .5 

Document A will be modified by the user to match the   keywords in Document B 
This is keyword  1 .1 

Document A will be modified by the user to match the   keywords in Document B 
Keywords can be any length .1 

Document A will be modified by the user to match the   keywords in Document B 
Some keywords will not match .1 

Document A will be modified by the user to match the   keywords in Document B 
Some keywords will match Document B .75 

Document A will be modified by the user to match the   keywords in Document B 
Keywords can contain text, numbers, and symbols .1 

This application is to help users add text, numbers and   symbols to improve 
their relevancy ranking Keywords can contain text, numbers, and symbols .9 

      
  Jon Ciampi
Mobile (415) 990-3151