You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jonathan Ciampi <jo...@yahoo.com> on 2010/09/30 22:58:26 UTC
Looking for advice on using Lucene to semantically compare two documents
Advice on comparing two documents.
Summary
This project is not a search engine but a semantic comparison between two
documents. The purpose of this application is to assist users in modifying the
text in a document to improve the relevancy rank of the document to another
document. For example, the user would want to compare Document A to Document B
to identify the text in Document A that has relevancy to Document B. Then, the
user would want the ability to identify the text to modify to improve the
relevancy rating.
Description:
Both documents are XML with tags identifying the keywords or blocks of text in
the document.
Sample Structure
Document A
<DocumentName>DocumentA</DocumentName>
<Keyword>This is keyword 1</Keyword>
<Keyword>Keywords can be any length</Keyword>
<Keyword>Some keywords will match Document B</Keyword>
<Keyword>Some keywords will not match</Keyword>
<Keyword>Keywords can contain text, numbers, and symbols</Keyword>
Document B
<DocumentName>DocumentB</DocumentName>
<Keyword>This is Document B keyword 1</Keyword>
<Keyword>Document B serves as the basis or standard for comparing</Keyword>
<Keyword>Document A will be modified by the user to match the keywords in
Document B</Keyword>
<Keyword>Document A and Document B will always be compared to each
other</Keyword>
<Keyword>This application is to help users add text, numbers and symbols to
improve their relevancy ranking</Keyword>
We believe we need to use Lucene to do semantic searches to determine
relevance. Our preferred output would be to show a user the words from each
document with their relevancy. To remove excessive data, the output would show
all keywords from Document B, and only those with a relevancy ranking from
Document A.
Sample Output
Document B Document A Relevancy
This is Document B keyword 1 This is keyword 1 .25
This is Document B keyword 1 Keywords can be any length .25
This is Document B keyword 1 Some keywords will match Document B .25
This is Document B keyword 1 Some keywords will not match .25
This is Document B keyword 1 Keywords can contain text, numbers, and symbols .25
Document B serves as the basis or standard for comparing Some keywords will
match Document B .5
Document A will be modified by the user to match the keywords in Document B
This is keyword 1 .1
Document A will be modified by the user to match the keywords in Document B
Keywords can be any length .1
Document A will be modified by the user to match the keywords in Document B
Some keywords will not match .1
Document A will be modified by the user to match the keywords in Document B
Some keywords will match Document B .75
Document A will be modified by the user to match the keywords in Document B
Keywords can contain text, numbers, and symbols .1
This application is to help users add text, numbers and symbols to improve
their relevancy ranking Keywords can contain text, numbers, and symbols .9
Jon Ciampi
Mobile (415) 990-3151