You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Ali Nazemian <al...@gmail.com> on 2015/07/17 09:28:09 UTC

Extracting article keywords using tf-idf algorithm

Dear Lucene/Solr developers,
Hi,
I decided to develop a plugin for Solr in order to extract main keywords
from article. Since Solr already did the hard-working for calculating
tf-idf scores I decided to use that for the sake of better performance. I
know that UpdateRequestProcessor is the best suited extension point for
adding keyword value to documents. I also find out that I have not any
access to tf-idf scores inside the UpdateRequestProcessor, because of the
fact that UpdateRequestProcessor chain will be applied before the process
of calculating tf-idf scores. Hence, with consulting with Solr/Lucene
developers I decided to go for searchComponent in order to calculate
keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize.
Unfortunately toward this approach, strange core behavior was observed. For
example sometimes facet wont work on this keyword field or the index
becomes unstable in search results.
I really appreciate if someone help me to make it stable.


NamedList response = new SimpleOrderedMap();
    keyword.init(searcher, params);
    BooleanQuery query = new BooleanQuery();
    for (String fieldName : keywordSourceFields) {
      TermQuery termQuery = new TermQuery(new Term(fieldName, "noval"));
      query.add(termQuery, Occur.MUST_NOT);
    }
    TermQuery termQuery = new TermQuery(new Term(keywordField, "noval"));
    query.add(termQuery, Occur.MUST);
    RefCounted<IndexWriter> iw = null;
    IndexWriter writer = null;
    try {
      TopDocs results = searcher.search(query, maxNumDocs);
      ScoreDoc[] hits = results.scoreDocs;
      iw = solrCoreState.getIndexWriter(core);
      writer = iw.get();
      FieldType type = new FieldType(StringField.TYPE_STORED);
      for (int i = 0; i < hits.length; i++) {
        Document document = searcher.doc(hits[i].doc);
        List<String> keywords = keyword.getKeywords(hits[i].doc);
        if (keywords.size() > 0) document.removeFields(keywordField);
        for (String word : keywords) {
          document.add(new Field(keywordField, word, type));
        }
        String uniqueKey =
searcher.getSchema().getUniqueKeyField().getName();
        writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)),
            document);
      }
      response.add("Number of Selected Docs", results.totalHits);
      writer.commit();
    } catch (IOException | SyntaxError e) {
      throw new RuntimeException();
    } finally {
      if (iw != null) {
        iw.decref();
      }
    }


public List<String> getKeywords(int docId) throws SyntaxError {
    String[] fields = new String[keywordSourceFields.size()];
    List<String> terms = new ArrayList<String>();
    fields = keywordSourceFields.toArray(fields);
    mlt.setFieldNames(fields);
    mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
    mlt.setMinTermFreq(minTermFreq);
    mlt.setMinDocFreq(minDocFreq);
    mlt.setMinWordLen(minWordLen);
    mlt.setMaxQueryTerms(maxNumKeywords);
    mlt.setMaxNumTokensParsed(maxTokensParsed);
    try {

      terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
    } catch (IOException e) {
      LOGGER.error(e.getMessage());
      throw new RuntimeException();
    }

    return terms;
  }

Best regards.
-- 
A.Nazemian

Re: Extracting article keywords using tf-idf algorithm

Posted by Ali Nazemian <al...@gmail.com>.

Hi again,
It seems that my problem with the strange behavior of Solr caused by the
fact that I tried to update documents and add keyword field inside the
Lucene index (not from using Solrj API) for the sake of better performance,
But it seems that some processes ignored by this way of modifying index.
(which is obvious) These processes that I am not aware of them are caused
the inconsistency.
One solution would be updating Index by adding a new document with using
SolrJ. As I mentioned this solution is not the best one in case of
performance concerns. (The indexing time would be doubled) Therefore it
would be nice if there are any possible and reliable solution available for
my problem with considering the performance concerns.

Best regards.

On Sat, Jul 18, 2015 at 9:40 PM, Ali Nazemian <al...@gmail.com> wrote:

> Dear Diego,
> Hi,
> Yeah, exactly what I want.
> As Shawn said it is acronym for More Like This. Actually since Lucene
> already did the hardworking for the purpose of calculating interesting
> terms, I just want to use that for adding a multi-value field to all
> indexed documents.
>
> Best regards.
>
> On Sat, Jul 18, 2015 at 8:08 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 7/18/2015 9:16 AM, Diego Ceccarelli wrote:
>> > Could you please post your code somewhere? I don't understand what is
>> > "mlt"  :)
>>
>> This is an acronym that means "More Like This".
>>
>> https://wiki.apache.org/solr/MoreLikeThis
>>
>> Thanks,
>> Shawn
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
>
> --
> A.Nazemian
>

-- 
A.Nazemian

Re: Extracting article keywords using tf-idf algorithm

Posted by Ali Nazemian <al...@gmail.com>.

Dear Diego,
Hi,
Yeah, exactly what I want.
As Shawn said it is acronym for More Like This. Actually since Lucene
already did the hardworking for the purpose of calculating interesting
terms, I just want to use that for adding a multi-value field to all
indexed documents.

Best regards.

On Sat, Jul 18, 2015 at 8:08 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 7/18/2015 9:16 AM, Diego Ceccarelli wrote:
> > Could you please post your code somewhere? I don't understand what is
> > "mlt"  :)
>
> This is an acronym that means "More Like This".
>
> https://wiki.apache.org/solr/MoreLikeThis
>
> Thanks,
> Shawn
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

-- 
A.Nazemian

Re: Extracting article keywords using tf-idf algorithm

Posted by Shawn Heisey <ap...@elyograg.org>.

On 7/18/2015 9:16 AM, Diego Ceccarelli wrote:
> Could you please post your code somewhere? I don't understand what is
> "mlt"  :)

This is an acronym that means "More Like This".

https://wiki.apache.org/solr/MoreLikeThis

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Extracting article keywords using tf-idf algorithm

Posted by Diego Ceccarelli <di...@gmail.com>.

Dear Ali,

I'm not sure I understand what you are trying to do, please correct me if I
misunderstood:
given a document indexed into lucene you want to retrieve the top-k terms
with highest tf-idf right?

Could you please post your code somewhere? I don't understand what is
"mlt"  :)

Cheers,
Diego


On Fri, Jul 17, 2015 at 8:28 AM, Ali Nazemian <al...@gmail.com> wrote:

> Dear Lucene/Solr developers,
> Hi,
> I decided to develop a plugin for Solr in order to extract main keywords
> from article. Since Solr already did the hard-working for calculating
> tf-idf scores I decided to use that for the sake of better performance. I
> know that UpdateRequestProcessor is the best suited extension point for
> adding keyword value to documents. I also find out that I have not any
> access to tf-idf scores inside the UpdateRequestProcessor, because of the
> fact that UpdateRequestProcessor chain will be applied before the process
> of calculating tf-idf scores. Hence, with consulting with Solr/Lucene
> developers I decided to go for searchComponent in order to calculate
> keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize.
> Unfortunately toward this approach, strange core behavior was observed. For
> example sometimes facet wont work on this keyword field or the index
> becomes unstable in search results.
> I really appreciate if someone help me to make it stable.
>
>
> NamedList response = new SimpleOrderedMap();
>     keyword.init(searcher, params);
>     BooleanQuery query = new BooleanQuery();
>     for (String fieldName : keywordSourceFields) {
>       TermQuery termQuery = new TermQuery(new Term(fieldName, "noval"));
>       query.add(termQuery, Occur.MUST_NOT);
>     }
>     TermQuery termQuery = new TermQuery(new Term(keywordField, "noval"));
>     query.add(termQuery, Occur.MUST);
>     RefCounted<IndexWriter> iw = null;
>     IndexWriter writer = null;
>     try {
>       TopDocs results = searcher.search(query, maxNumDocs);
>       ScoreDoc[] hits = results.scoreDocs;
>       iw = solrCoreState.getIndexWriter(core);
>       writer = iw.get();
>       FieldType type = new FieldType(StringField.TYPE_STORED);
>       for (int i = 0; i < hits.length; i++) {
>         Document document = searcher.doc(hits[i].doc);
>         List<String> keywords = keyword.getKeywords(hits[i].doc);
>         if (keywords.size() > 0) document.removeFields(keywordField);
>         for (String word : keywords) {
>           document.add(new Field(keywordField, word, type));
>         }
>         String uniqueKey =
> searcher.getSchema().getUniqueKeyField().getName();
>         writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)),
>             document);
>       }
>       response.add("Number of Selected Docs", results.totalHits);
>       writer.commit();
>     } catch (IOException | SyntaxError e) {
>       throw new RuntimeException();
>     } finally {
>       if (iw != null) {
>         iw.decref();
>       }
>     }
>
>
> public List<String> getKeywords(int docId) throws SyntaxError {
>     String[] fields = new String[keywordSourceFields.size()];
>     List<String> terms = new ArrayList<String>();
>     fields = keywordSourceFields.toArray(fields);
>     mlt.setFieldNames(fields);
>     mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
>     mlt.setMinTermFreq(minTermFreq);
>     mlt.setMinDocFreq(minDocFreq);
>     mlt.setMinWordLen(minWordLen);
>     mlt.setMaxQueryTerms(maxNumKeywords);
>     mlt.setMaxNumTokensParsed(maxTokensParsed);
>     try {
>
>       terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
>     } catch (IOException e) {
>       LOGGER.error(e.getMessage());
>       throw new RuntimeException();
>     }
>
>     return terms;
>   }
>
> Best regards.
> --
> A.Nazemian
>

Re: Extracting article keywords using tf-idf algorithm

Posted by Diego Ceccarelli <di...@gmail.com>.

Dear Ali,

I'm not sure I understand what you are trying to do, please correct me if I
misunderstood:
given a document indexed into lucene you want to retrieve the top-k terms
with highest tf-idf right?

Could you please post your code somewhere? I don't understand what is
"mlt"  :)

Cheers,
Diego


On Fri, Jul 17, 2015 at 8:28 AM, Ali Nazemian <al...@gmail.com> wrote:

> Dear Lucene/Solr developers,
> Hi,
> I decided to develop a plugin for Solr in order to extract main keywords
> from article. Since Solr already did the hard-working for calculating
> tf-idf scores I decided to use that for the sake of better performance. I
> know that UpdateRequestProcessor is the best suited extension point for
> adding keyword value to documents. I also find out that I have not any
> access to tf-idf scores inside the UpdateRequestProcessor, because of the
> fact that UpdateRequestProcessor chain will be applied before the process
> of calculating tf-idf scores. Hence, with consulting with Solr/Lucene
> developers I decided to go for searchComponent in order to calculate
> keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize.
> Unfortunately toward this approach, strange core behavior was observed. For
> example sometimes facet wont work on this keyword field or the index
> becomes unstable in search results.
> I really appreciate if someone help me to make it stable.
>
>
> NamedList response = new SimpleOrderedMap();
>     keyword.init(searcher, params);
>     BooleanQuery query = new BooleanQuery();
>     for (String fieldName : keywordSourceFields) {
>       TermQuery termQuery = new TermQuery(new Term(fieldName, "noval"));
>       query.add(termQuery, Occur.MUST_NOT);
>     }
>     TermQuery termQuery = new TermQuery(new Term(keywordField, "noval"));
>     query.add(termQuery, Occur.MUST);
>     RefCounted<IndexWriter> iw = null;
>     IndexWriter writer = null;
>     try {
>       TopDocs results = searcher.search(query, maxNumDocs);
>       ScoreDoc[] hits = results.scoreDocs;
>       iw = solrCoreState.getIndexWriter(core);
>       writer = iw.get();
>       FieldType type = new FieldType(StringField.TYPE_STORED);
>       for (int i = 0; i < hits.length; i++) {
>         Document document = searcher.doc(hits[i].doc);
>         List<String> keywords = keyword.getKeywords(hits[i].doc);
>         if (keywords.size() > 0) document.removeFields(keywordField);
>         for (String word : keywords) {
>           document.add(new Field(keywordField, word, type));
>         }
>         String uniqueKey =
> searcher.getSchema().getUniqueKeyField().getName();
>         writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)),
>             document);
>       }
>       response.add("Number of Selected Docs", results.totalHits);
>       writer.commit();
>     } catch (IOException | SyntaxError e) {
>       throw new RuntimeException();
>     } finally {
>       if (iw != null) {
>         iw.decref();
>       }
>     }
>
>
> public List<String> getKeywords(int docId) throws SyntaxError {
>     String[] fields = new String[keywordSourceFields.size()];
>     List<String> terms = new ArrayList<String>();
>     fields = keywordSourceFields.toArray(fields);
>     mlt.setFieldNames(fields);
>     mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
>     mlt.setMinTermFreq(minTermFreq);
>     mlt.setMinDocFreq(minDocFreq);
>     mlt.setMinWordLen(minWordLen);
>     mlt.setMaxQueryTerms(maxNumKeywords);
>     mlt.setMaxNumTokensParsed(maxTokensParsed);
>     try {
>
>       terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
>     } catch (IOException e) {
>       LOGGER.error(e.getMessage());
>       throw new RuntimeException();
>     }
>
>     return terms;
>   }
>
> Best regards.
> --
> A.Nazemian
>