You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Witdouck, Xavier" <Xa...@blackrock.com> on 2014/01/24 17:23:23 UTC

Building term frequency matrix over 6 million documents...

Hi all,

We have over 6 million documents in our index, and would like to construct a term frequency matrix over all 6 million documents as quickly as possible.  Each document has a numeric date field, so we would like to build a time series which contains values which are the sum of all frequencies for documents on that date.  So for example, if the term was "iPhone", we would want a time series which contained the sum of all iPhone mentions across all buckets, but decomposed into time buckets.

The approach we have tried is to write a custom Collector as below, but this seems really, really slow...any way of approaching this differently to make it perform much better?

@Override()
public void collect(int docId) throws IOException {
    try {
      ++collectCount;
      if (reader != null) {
          final Terms terms = reader.getTermVector(docId, field);
          termsEnum = terms.iterator(termsEnum);
          final int colIndex = matrix.columns().add(term);
          if (termsEnum.seekExact(termRef)) {
            final DocsAndPositionsEnum docsAndPositionsEnum = termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_FREQS);
            while (docsAndPositionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS){
                final int date = dates.get(docId);
                final int freq = docsAndPositionsEnum.freq();
                final int rowIndex = matrix.rows().add(date);
                final double value = matrix.getDouble(rowIndex, colIndex);
                matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ? freq : value + freq);
                if (++docCount % 1000 == 0) {
                  LOG.info("Processed " + docCount + " / " + collectCount + " documents in term frequency analysis...");
                }
            }
          }
      }
    } catch (Throwable t) {
      throw new RuntimeException("Failed to collect document " + docId, t);
    }
}

@Override()
public void setNextReader(AtomicReaderContext atomicReaderContext) throws IOException {
    this.reader = atomicReaderContext.reader();
    this.dates = FieldCache.DEFAULT.getInts(reader, "date", false);
}

Any help would be much appreciated...

Thanks,
Zav

THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE PRIVILEGED.  If this message was misdirected, BlackRock, Inc. and its subsidiaries, ("BlackRock") does not waive any confidentiality or privilege.  If you are not the intended recipient, please notify us immediately and destroy the message without disclosing its contents to anyone.  Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized.  The views and opinions expressed in this e-mail message are the author's own and may not reflect the views and opinions of BlackRock, unless the author is authorized by BlackRock to express such views or opinions on its behalf.  All email sent to or from this address is subject to electronic storage and review by BlackRock.  Although BlackRock operates anti-virus programs, it does not accept responsibility for any damage whatsoever caused by viruses being passed.



--
BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK) Limited are authorised and regulated by the Financial Conduct Authority. Registered in England No. 796793 and No. 2020394 respectively. BlackRock Life Limited is authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority. Registered in England No. 2223202. Registered Offices: Drapers Gardens, 12 Throgmorton Avenue, London EC2N 2DL. BlackRock International Limited is authorised and regulated by the Financial Conduct Authority and is a registered investment adviser with the Securities and Exchange Commission (SEC).  Registered in Scotland No. SC160821. Registered Office: 40 Torphichen Street, Edinburgh, EH3 8JB.

© 2013 BlackRock, Inc. All Rights reserved.

Re: Building term frequency matrix over 6 million documents...

Posted by Marcio Napoli <na...@gmail.com>.
Hi!

I believe the approach below can help you.
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java

Marcio
http://numere.stela.org.br
Go beyond Lucene™ features with Numere®





2014/1/24 Witdouck, Xavier <Xa...@blackrock.com>

> Hi all,
>
> We have over 6 million documents in our index, and would like to construct
> a term frequency matrix over all 6 million documents as quickly as
> possible.  Each document has a numeric date field, so we would like to
> build a time series which contains values which are the sum of all
> frequencies for documents on that date.  So for example, if the term was
> "iPhone", we would want a time series which contained the sum of all iPhone
> mentions across all buckets, but decomposed into time buckets.
>
> The approach we have tried is to write a custom Collector as below, but
> this seems really, really slow...any way of approaching this differently to
> make it perform much better?
>
> @Override()
> public void collect(int docId) throws IOException {
>     try {
>       ++collectCount;
>       if (reader != null) {
>           final Terms terms = reader.getTermVector(docId, field);
>           termsEnum = terms.iterator(termsEnum);
>           final int colIndex = matrix.columns().add(term);
>           if (termsEnum.seekExact(termRef)) {
>             final DocsAndPositionsEnum docsAndPositionsEnum =
> termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_FREQS);
>             while (docsAndPositionsEnum.nextDoc() !=
> DocIdSetIterator.NO_MORE_DOCS){
>                 final int date = dates.get(docId);
>                 final int freq = docsAndPositionsEnum.freq();
>                 final int rowIndex = matrix.rows().add(date);
>                 final double value = matrix.getDouble(rowIndex, colIndex);
>                 matrix.setDouble(rowIndex, colIndex, Double.isNaN(value) ?
> freq : value + freq);
>                 if (++docCount % 1000 == 0) {
>                   LOG.info("Processed " + docCount + " / " + collectCount
> + " documents in term frequency analysis...");
>                 }
>             }
>           }
>       }
>     } catch (Throwable t) {
>       throw new RuntimeException("Failed to collect document " + docId, t);
>     }
> }
>
> @Override()
> public void setNextReader(AtomicReaderContext atomicReaderContext) throws
> IOException {
>     this.reader = atomicReaderContext.reader();
>     this.dates = FieldCache.DEFAULT.getInts(reader, "date", false);
> }
>
> Any help would be much appreciated...
>
> Thanks,
> Zav
>
> THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE
> PRIVILEGED.  If this message was misdirected, BlackRock, Inc. and its
> subsidiaries, ("BlackRock") does not waive any confidentiality or
> privilege.  If you are not the intended recipient, please notify us
> immediately and destroy the message without disclosing its contents to
> anyone.  Any distribution, use or copying of this e-mail or the information
> it contains by other than an intended recipient is unauthorized.  The views
> and opinions expressed in this e-mail message are the author's own and may
> not reflect the views and opinions of BlackRock, unless the author is
> authorized by BlackRock to express such views or opinions on its behalf.
>  All email sent to or from this address is subject to electronic storage
> and review by BlackRock.  Although BlackRock operates anti-virus programs,
> it does not accept responsibility for any damage whatsoever caused by
> viruses being passed.
>
>
>
> --
> BlackRock Advisors (UK) Limited and BlackRock Investment Management (UK)
> Limited are authorised and regulated by the Financial Conduct Authority.
> Registered in England No. 796793 and No. 2020394 respectively. BlackRock
> Life Limited is authorised by the Prudential Regulation Authority and
> regulated by the Financial Conduct Authority and Prudential Regulation
> Authority. Registered in England No. 2223202. Registered Offices: Drapers
> Gardens, 12 Throgmorton Avenue, London EC2N 2DL. BlackRock International
> Limited is authorised and regulated by the Financial Conduct Authority and
> is a registered investment adviser with the Securities and Exchange
> Commission (SEC).  Registered in Scotland No. SC160821. Registered Office:
> 40 Torphichen Street, Edinburgh, EH3 8JB.
>
> © 2013 BlackRock, Inc. All Rights reserved.