You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Vishmi Money (JIRA)" <ji...@apache.org> on 2014/02/28 19:30:22 UTC

[jira] [Commented] (LUCENE-5422) Postings lists deduplication

    [ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916140#comment-13916140 ] 

Vishmi Money commented on LUCENE-5422:
--------------------------------------

Hi,
I am Vishmi Money and I am a third year undergraduate at Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka.

I am familiar with Lucene as I have read and learnt about it for a project in which I have  tried to implement Global Search for moodle. But then I found out that Lucene was a dead end for that as moodle is a php implementation.

After going through the discussion you provided, I am very interested to work on this project for GSoc 2014 because I am very intersted in Data Structures and Algorithms area too.

Can you further explain me about the relationship of LUCENE-2082 to LUCENE-5422?
so that I can start work on this project.



> Postings lists deduplication
> ----------------------------
>
>                 Key: LUCENE-5422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/index
>            Reporter: Dmitry Kan
>              Labels: gsoc2014
>
> The context:
> http://markmail.org/thread/tywtrjjcfdbzww6f
> Robert Muir and I have discussed what Robert eventually named "postings
> lists deduplication" at Berlin Buzzwords 2013 conference.
> The idea is to allow multiple terms to point to the same postings list to
> save space. This can be achieved by new index codec implementation, but this jira is open to other ideas as well.
> The application / impact of this is positive for synonyms, exact / inexact
> terms, leading wildcard support via storing reversed term etc.
> For example, at the moment, when supporting exact (unstemmed) and inexact (stemmed)
> searches, we store both unstemmed and stemmed variant of a word form and
> that leads to index bloating. That is why we had to remove the leading
> wildcard support via reversing a token on index and query time because of
> the same index size considerations.
> Comment from Mike McCandless:
> Neat idea!
> Would this idea allow a single term to point to (the union of) N other
> posting lists?  It seems like that's necessary e.g. to handle the
> exact/inexact case.
> And then, to produce the Docs/AndPositionsEnum you'd need to do the
> merge sort across those N posting lists?
> Such a thing might also be do-able as runtime only wrapper around the
> postings API (FieldsProducer), if you could at runtime do the reverse
> expansion (e.g. stem -> all of its surface forms).
> Comment from Robert Muir:
> I think the exact/inexact is trickier (detecting it would be the hard
> part), and you are right, another solution might work better.
> but for the reverse wildcard and synonyms situation, it seems we could even
> detect it on write if we created some hash of the previous terms postings.
> if the hash matches for the current term, we know it might be a "duplicate"
> and would have to actually do the costly check they are the same.
> maybe there are better ways to do it, but it might be a fun postingformat
> experiment to try.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org