You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Gregor Heinrich <gr...@arbylon.net> on 2010/11/19 08:49:35 UTC

lucene.index.*: extending Lucene to store topic model data ?

Dear list -- a question on potential storage of data originating from "topic 
models" like LSA (latent semantic analysis) and LDA (latent Dirichlet 
allocation). Packages like Mahout or SemanticVectors allow extraction of latent 
topics from an existing Lucene corpus. They don't have the storage of the actual 
latent concepts integrated into Lucene's efficient backend. So storing those 
data withing Lucene's segments may be a benefit for them.

My question: In the IndexWriter backend, is there any reasonable way you can 
think of to store extra information after segments have been created but before 
a commit() ? (This way any IndexSearcher/Reader always sees a consistent index.) 
Further, after the optimize() step, another modification of the extra 
information in index should be possible.

Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from the 
information in the index and stores topic related data with the segments 
currently active for indexing, but in extra files. The extra files contain 
document-specific topic float vectors as well as segment-global float vectors. 
During commit(), the extra files are merged with the segments (which involves 
some math processing again). At the end of the indexing process, the LDA 
algorithm is rerun, improving the topic model globally, thus again modifying the 
extra files.

What may be a point of departure? Adding a modified TermVector-like storage 
approach and hooking it to extended Segment* classes?

Best regards

gregor



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: lucene.index.*: extending Lucene to store topic model data ?

Posted by Gregor Heinrich <gr...@arbylon.net>.
Hi Uwe -- thanks for this great hint. Is it considered stable enough to throw 
corpora at it that have 100MB etc. raw text?

ps -- sorry for staying cryptic about the actual application. I tried to 
abstract its relation to Lucene... Basically it's about automatically 
associating queries and documents with groups of related terms (topics) and thus 
improving recall. I wrote an introductory note about this stuff that may give an 
overview and cites much of the original literature: 
http://www.arbylon.net/publications/text-est2.pdf .

All the best

gregor


On 11/19/10 9:07 AM, Uwe Schindler wrote:
> Hi Gregor,
>
> I do not come from your area, so I don't understand all the stuff you are
> writing about, but from what you write, it looks that you are interested in
> the new flexible indexing coming with Lucene 4.0 aka Lucene trunk? Currently
> flexible indexing only allows to modify term dictionary and posting lists
> currently (the 4-dim Enum api in Lucene), but in the future we will also
> allow to modify index format of stotred fields/term vectors. We already
> started to have patches that allow per-field/document statistics for BM25
> scoring.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Gregor Heinrich [mailto:gregor@arbylon.net]
>> Sent: Friday, November 19, 2010 8:50 AM
>> To: dev@lucene.apache.org
>> Subject: lucene.index.*: extending Lucene to store topic model data ?
>>
>> Dear list -- a question on potential storage of data originating from
> "topic
>> models" like LSA (latent semantic analysis) and LDA (latent Dirichlet
> allocation).
>> Packages like Mahout or SemanticVectors allow extraction of latent topics
> from
>> an existing Lucene corpus. They don't have the storage of the actual
> latent
>> concepts integrated into Lucene's efficient backend. So storing those data
>> withing Lucene's segments may be a benefit for them.
>>
>> My question: In the IndexWriter backend, is there any reasonable way you
> can
>> think of to store extra information after segments have been created but
>> before a commit() ? (This way any IndexSearcher/Reader always sees a
>> consistent index.) Further, after the optimize() step, another
> modification of the
>> extra information in index should be possible.
>>
>> Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from
>> the information in the index and stores topic related data with the
> segments
>> currently active for indexing, but in extra files. The extra files contain
>> document-specific topic float vectors as well as segment-global float
> vectors.
>> During commit(), the extra files are merged with the segments (which
> involves
>> some math processing again). At the end of the indexing process, the LDA
>> algorithm is rerun, improving the topic model globally, thus again
> modifying
>> the extra files.
>>
>> What may be a point of departure? Adding a modified TermVector-like
> storage
>> approach and hooking it to extended Segment* classes?
>>
>> Best regards
>>
>> gregor
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
>> commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: lucene.index.*: extending Lucene to store topic model data ?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Gregor,

I do not come from your area, so I don't understand all the stuff you are
writing about, but from what you write, it looks that you are interested in
the new flexible indexing coming with Lucene 4.0 aka Lucene trunk? Currently
flexible indexing only allows to modify term dictionary and posting lists
currently (the 4-dim Enum api in Lucene), but in the future we will also
allow to modify index format of stotred fields/term vectors. We already
started to have patches that allow per-field/document statistics for BM25
scoring.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Gregor Heinrich [mailto:gregor@arbylon.net]
> Sent: Friday, November 19, 2010 8:50 AM
> To: dev@lucene.apache.org
> Subject: lucene.index.*: extending Lucene to store topic model data ?
> 
> Dear list -- a question on potential storage of data originating from
"topic
> models" like LSA (latent semantic analysis) and LDA (latent Dirichlet
allocation).
> Packages like Mahout or SemanticVectors allow extraction of latent topics
from
> an existing Lucene corpus. They don't have the storage of the actual
latent
> concepts integrated into Lucene's efficient backend. So storing those data
> withing Lucene's segments may be a benefit for them.
> 
> My question: In the IndexWriter backend, is there any reasonable way you
can
> think of to store extra information after segments have been created but
> before a commit() ? (This way any IndexSearcher/Reader always sees a
> consistent index.) Further, after the optimize() step, another
modification of the
> extra information in index should be possible.
> 
> Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from
> the information in the index and stores topic related data with the
segments
> currently active for indexing, but in extra files. The extra files contain
> document-specific topic float vectors as well as segment-global float
vectors.
> During commit(), the extra files are merged with the segments (which
involves
> some math processing again). At the end of the indexing process, the LDA
> algorithm is rerun, improving the topic model globally, thus again
modifying
> the extra files.
> 
> What may be a point of departure? Adding a modified TermVector-like
storage
> approach and hooking it to extended Segment* classes?
> 
> Best regards
> 
> gregor
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org