You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Han Jiang <ji...@gmail.com> on 2012/03/21 16:03:36 UTC

[GSoC]About some general information

Hi All,

I'm Billy, a senior undergraduate student in Peking University. I'm working
in the area of Information Retrieval and Web Mining. When going through the
idea list, I felt quite interested in the
LUCENE-3892<https://issues.apache.org/jira/browse/LUCENE-3892>and
LUCENE-3069 <https://issues.apache.org/jira/browse/LUCENE-3069>. I am very
proficient on java, and have been using lucene for about one year. I am
looking forward to make a contribution to this project.

Here, I have a few questions about lucene:

First of all,  which version of lucene shall we use as a start point? The
trunk or 3.5?
Is there any demo codes to show the idea of Codecs?
How many posting formats are supposed to be implemented, for project
LUCENE-3892 <https://issues.apache.org/jira/browse/LUCENE-3892> ?
Is there any further documentation for
LUCENE-3069<https://issues.apache.org/jira/browse/LUCENE-3069>?

Thank you!

-- 
Han Jiang

EECS, Peking University, China
Every Effort Creates Smile

Senior Student

Re: [GSoC]About some general information

Posted by Michael McCandless <lu...@mikemccandless.com>.
Hello!  Answers below...:

On Wed, Mar 21, 2012 at 11:03 AM, Han Jiang <ji...@gmail.com> wrote:
> Hi All,
>
> I'm Billy, a senior undergraduate student in Peking University. I'm working
> in the area of Information Retrieval and Web Mining. When going through the
> idea list, I felt quite interested in the LUCENE-3892 and LUCENE-3069. I am
> very proficient on java, and have been using lucene for about one year. I am
> looking forward to make a contribution to this project.

Awesome.

> Here, I have a few questions about lucene:
>
> First of all,  which version of lucene shall we use as a start point? The
> trunk or 3.5?

Both of these issues will be trunk only I think: they both are far
easier to do with the Codec API in 4.0.

> Is there any demo codes to show the idea of Codecs?

Maybe the simplest demo would be to look at the SimpleText codec?  It
roughly "tries" to have simple source code as well as a simple (text
only, human readable) on-disk format.

> How many posting formats are supposed to be implemented, for project
> LUCENE-3892 ?

This can be worked out when scoping the project... but I think getting
one postings format working well would be awesome :)  If somehow
that's too easy, then add more!

> Is there any further documentation for LUCENE-3069 ?

Not that I know of... but I suspect the approach can be very similar
to the MemoryPostingsFormat we already have, just that it'd only be
the terms data stored in the FST, while the postings
(docs/freqs/positions/offsets) are written to a file.

Ideally, it would just act like a different terms dictionary
implementation, ie so that we can then plug in any PostingsBaseFormat
(even the one from LUCENE-3892!).

> Thank you!

You're welcome, and welcome to Lucene/Solr!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org