You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Varun Thacker <va...@gmail.com> on 2011/04/05 19:56:59 UTC

My GSOC proposal

Hi,

I'm Varun Thacker , a Computer Science student from Manipal Institute
of Technology , India. I am interested in contributing towards the
Lucene project as part of GSOC 2011.

I would like to combine two tasks as part of my project
namely-Directory createOutput and openInput should take an IOContext
(Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
UnixDir (Lucene-2795).

The first part of the project is aimed at significantly reducing time
taken to search during indexing by adding an IOContext which would
store buffer size and have options to bypass the OS’s buffer cache
(This is what causes the slowdown in search ) and other hints. Once
completed I would move on to Lucene-2795 and generalize the Directory
implementation to make a UnixDirectory .

I am a active member of our college's Linux Users Group
(http://lugmanipal.org/) and have actively participated in FOSS
activities in India, attending Pycon India 2009 and FOSS.IN in 2010.
In December 2010 I helped an Indian Institute of Technology , Delhi
professor in coding for his research paper on the Quadratic Assignment
problem.

I have spoken to Micheal McCandless and Simon Willnauer about
undertaking these tasks. Micheal McCandless has agreed to mentor me .
I would love to be able to contribute and learn from Apache Lucene
community this summer. Also I would love suggestions on how to make my
application proposal stronger.


-- 


Regards,
Varun Thacker
http://varunthacker.wordpress.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: My GSOC proposal

Posted by Varun Thacker <va...@gmail.com>.

I have refined my proposal here : http://goo.gl/uYXrV

Are there any suggestions for which I need to update my proposal before
today's deadline .

On Thu, Apr 7, 2011 at 9:28 AM, Varun Thacker <va...@gmail.com>wrote:

> I have updated my proposal online to mention the time I would be able to
> dedicate to the project .
>
>
> On Thu, Apr 7, 2011 at 7:05 AM, Adriano Crestani <
> adrianocrestani@gmail.com> wrote:
>
>> Hi Varun,
>>
>> Nice proposal, very complete. Only one thing missing, you should mention
>> somewhere how many hours a week you are willing to spend working on the
>> project and whether there is any holiday you won't be able to work.
>>
>> Good luck ;)
>>
>>
>> On Wed, Apr 6, 2011 at 5:57 PM, Varun Thacker <varunthacker1989@gmail.com
>> > wrote:
>>
>>> I have drafted the proposal on the official GSoC website . This is the
>>> link to my proposal http://goo.gl/uYXrV . Please do let me know if
>>> anything needs to be changed ,added or removed.
>>>
>>> I will keep on working on it till the deadline on the 8th.
>>>
>>> On Wed, Apr 6, 2011 at 11:41 PM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>>
>>>> That test code looks good -- you really should have seen awful
>>>> performance had you used O_DIRECT since you read byte by byte.
>>>>
>>>> A more realistic test is to read a whole buffer (eg 4 KB is what
>>>> Lucene now uses during merging, but we'd probably up this to like 1 MB
>>>> when using O_DIRECT).
>>>>
>>>> Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
>>>> for good reason: its existence means projects like ours can use it to
>>>> "work around" limitations in the Linux IO apis that control the buffer
>>>> cache when, otherwise, we might conceivably make patches to fix Linux
>>>> correctly.  It's an escape hatch, and we all use the escape hatch
>>>> instead of trying to fix Linux for real...
>>>>
>>>> For example the NOREUSE flag is a no-op now in Linux, which is a
>>>> shame, because that's precisely the flag we'd want to use for merging
>>>> (along with SEQUENTIAL).  Had that flag been implemented well, it'd
>>>> give better results than our workaround using O_DIRECT.
>>>>
>>>> Anyway, giving how things are, until we can get more control (waaaay
>>>> up in Javaland) over the buffer cache, O_DIRECT (via native directory
>>>> impl through JNI) is our only real option, today.
>>>>
>>>> More details here:
>>>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
>>>>
>>>> Note that other OSs likely do a better job and actually implement
>>>> NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
>>>> would simply use NOREUSE on these platforms for I/O during segment
>>>> merging.
>>>>
>>>> Mike
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
>>>>  <va...@gmail.com> wrote:
>>>> > Hi. I wrote a sample code to test out speed difference between
>>>> SEQUENTIAL
>>>> > and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
>>>> >
>>>> > This is the link to the code: http://pastebin.com/8QywKGyS
>>>> >
>>>> > There was a speed difference which when i switched between the two
>>>> flags. I
>>>> > have not used the O_DIRECT flag because Linus had criticized it.
>>>> >
>>>> > Is this what the flags are intended to be used for ? This is just a
>>>> sample
>>>> > code with a test file .
>>>> >
>>>> > On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
>>>> > <si...@googlemail.com> wrote:
>>>> >> Hey Varun,
>>>> >> On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
>>>> >> <lu...@mikemccandless.com> wrote:
>>>> >>> Hi Varun,
>>>> >>>
>>>> >>> Those two issues would make a great GSoC!  Comments below...
>>>> >> +1
>>>> >>>
>>>> >>> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
>>>> >>> <va...@gmail.com> wrote:
>>>> >>>
>>>> >>>> I would like to combine two tasks as part of my project
>>>> >>>> namely-Directory createOutput and openInput should take an
>>>> IOContext
>>>> >>>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
>>>> >>>> UnixDir (Lucene-2795).
>>>> >>>>
>>>> >>>> The first part of the project is aimed at significantly reducing
>>>> time
>>>> >>>> taken to search during indexing by adding an IOContext which would
>>>> >>>> store buffer size and have options to bypass the OS’s buffer cache
>>>> >>>> (This is what causes the slowdown in search ) and other hints. Once
>>>> >>>> completed I would move on to Lucene-2795 and generalize the
>>>> Directory
>>>> >>>> implementation to make a UnixDirectory .
>>>> >>>
>>>> >>> So, the first part (LUCENE-2793) should cause no change at all to
>>>> >>> performance, functionality, etc., because it's "merely" installing
>>>> the
>>>> >>> plumbing (IOContext threaded throughout the low-level store APIs in
>>>> >>> Lucene) so that higher levels can send important details down to the
>>>> >>> Directory.  We'd fix IndexWriter/IndexReader to fill out this
>>>> >>> IOContext with the details (merging, flushing, new reader, etc.).
>>>> >>>
>>>> >>> There's some fun/freedom here in figuring out just what details
>>>> should
>>>> >>> be included in IOContext... (eg: is it low level "set buffer size to
>>>> 4
>>>> >>> KB"
>>>> >>> or is it high level "I am opening a new near-real-time reader").
>>>> >>>
>>>> >>> This first step is a rote cutover, just changing APIs but in no way
>>>> >>> taking advantage of the new APIs.
>>>> >>>
>>>> >>> The 2nd step (LUCENE-2795) would then take advantage of this
>>>> plumbing,
>>>> >>> by creating a UnixDir impl that, using JNI (C code), passes advanced
>>>> >>> flags when opening files, based on the incoming IOContext.
>>>> >>>
>>>> >>> The goal is a single UnixDir that has ifdefs so that it's usable
>>>> >>> across multiple Unices, and eg would use direct IO if the context is
>>>> >>> merging.  If we are ambitious we could rope Windows into the mix,
>>>> too,
>>>> >>> and then this would be NativeDir...
>>>> >>>
>>>> >>> We can measure success by validating that a big merge while
>>>> searching
>>>> >>> does not hurt search performance?  (Ie we should be able to
>>>> reproduce
>>>> >>> the results from
>>>> >>>
>>>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html).
>>>> >>
>>>> >> Thanks for the summary mike!
>>>> >>>
>>>> >>>> I have spoken to Micheal McCandless and Simon Willnauer about
>>>> >>>> undertaking these tasks. Micheal McCandless has agreed to mentor me
>>>> .
>>>> >>>> I would love to be able to contribute and learn from Apache Lucene
>>>> >>>> community this summer. Also I would love suggestions on how to make
>>>> my
>>>> >>>> application proposal stronger.
>>>> >>>
>>>> >>> I think either Simon or I can be the "official" mentor, and then the
>>>> >>> other one of us (and other Lucene committers) will support/chime
>>>> >>> in...
>>>> >>
>>>> >> I will take the official responsibility here once we are there!
>>>> >> simon
>>>> >>>
>>>> >>> This is an important change for Lucene!
>>>> >>>
>>>> >>> Mike
>>>> >>>
>>>> >>>
>>>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> >
>>>> > Regards,
>>>> > Varun Thacker
>>>> > http://varunthacker.wordpress.com
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards,
>>> Varun Thacker
>>> http://varunthacker.wordpress.com
>>>
>>>
>>>
>>
>
>
> --
>
>
> Regards,
> Varun Thacker
> http://varunthacker.wordpress.com
>
>
>


-- 


Regards,
Varun Thacker
http://varunthacker.wordpress.com

Re: My GSOC proposal

Posted by Varun Thacker <va...@gmail.com>.

I have updated my proposal online to mention the time I would be able to
dedicate to the project .

On Thu, Apr 7, 2011 at 7:05 AM, Adriano Crestani
<ad...@gmail.com>wrote:

> Hi Varun,
>
> Nice proposal, very complete. Only one thing missing, you should mention
> somewhere how many hours a week you are willing to spend working on the
> project and whether there is any holiday you won't be able to work.
>
> Good luck ;)
>
>
> On Wed, Apr 6, 2011 at 5:57 PM, Varun Thacker <va...@gmail.com>wrote:
>
>> I have drafted the proposal on the official GSoC website . This is the
>> link to my proposal http://goo.gl/uYXrV . Please do let me know if
>> anything needs to be changed ,added or removed.
>>
>> I will keep on working on it till the deadline on the 8th.
>>
>> On Wed, Apr 6, 2011 at 11:41 PM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> That test code looks good -- you really should have seen awful
>>> performance had you used O_DIRECT since you read byte by byte.
>>>
>>> A more realistic test is to read a whole buffer (eg 4 KB is what
>>> Lucene now uses during merging, but we'd probably up this to like 1 MB
>>> when using O_DIRECT).
>>>
>>> Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
>>> for good reason: its existence means projects like ours can use it to
>>> "work around" limitations in the Linux IO apis that control the buffer
>>> cache when, otherwise, we might conceivably make patches to fix Linux
>>> correctly.  It's an escape hatch, and we all use the escape hatch
>>> instead of trying to fix Linux for real...
>>>
>>> For example the NOREUSE flag is a no-op now in Linux, which is a
>>> shame, because that's precisely the flag we'd want to use for merging
>>> (along with SEQUENTIAL).  Had that flag been implemented well, it'd
>>> give better results than our workaround using O_DIRECT.
>>>
>>> Anyway, giving how things are, until we can get more control (waaaay
>>> up in Javaland) over the buffer cache, O_DIRECT (via native directory
>>> impl through JNI) is our only real option, today.
>>>
>>> More details here:
>>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
>>>
>>> Note that other OSs likely do a better job and actually implement
>>> NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
>>> would simply use NOREUSE on these platforms for I/O during segment
>>> merging.
>>>
>>> Mike
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
>>>  <va...@gmail.com> wrote:
>>> > Hi. I wrote a sample code to test out speed difference between
>>> SEQUENTIAL
>>> > and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
>>> >
>>> > This is the link to the code: http://pastebin.com/8QywKGyS
>>> >
>>> > There was a speed difference which when i switched between the two
>>> flags. I
>>> > have not used the O_DIRECT flag because Linus had criticized it.
>>> >
>>> > Is this what the flags are intended to be used for ? This is just a
>>> sample
>>> > code with a test file .
>>> >
>>> > On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
>>> > <si...@googlemail.com> wrote:
>>> >> Hey Varun,
>>> >> On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
>>> >> <lu...@mikemccandless.com> wrote:
>>> >>> Hi Varun,
>>> >>>
>>> >>> Those two issues would make a great GSoC!  Comments below...
>>> >> +1
>>> >>>
>>> >>> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
>>> >>> <va...@gmail.com> wrote:
>>> >>>
>>> >>>> I would like to combine two tasks as part of my project
>>> >>>> namely-Directory createOutput and openInput should take an IOContext
>>> >>>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
>>> >>>> UnixDir (Lucene-2795).
>>> >>>>
>>> >>>> The first part of the project is aimed at significantly reducing
>>> time
>>> >>>> taken to search during indexing by adding an IOContext which would
>>> >>>> store buffer size and have options to bypass the OS’s buffer cache
>>> >>>> (This is what causes the slowdown in search ) and other hints. Once
>>> >>>> completed I would move on to Lucene-2795 and generalize the
>>> Directory
>>> >>>> implementation to make a UnixDirectory .
>>> >>>
>>> >>> So, the first part (LUCENE-2793) should cause no change at all to
>>> >>> performance, functionality, etc., because it's "merely" installing
>>> the
>>> >>> plumbing (IOContext threaded throughout the low-level store APIs in
>>> >>> Lucene) so that higher levels can send important details down to the
>>> >>> Directory.  We'd fix IndexWriter/IndexReader to fill out this
>>> >>> IOContext with the details (merging, flushing, new reader, etc.).
>>> >>>
>>> >>> There's some fun/freedom here in figuring out just what details
>>> should
>>> >>> be included in IOContext... (eg: is it low level "set buffer size to
>>> 4
>>> >>> KB"
>>> >>> or is it high level "I am opening a new near-real-time reader").
>>> >>>
>>> >>> This first step is a rote cutover, just changing APIs but in no way
>>> >>> taking advantage of the new APIs.
>>> >>>
>>> >>> The 2nd step (LUCENE-2795) would then take advantage of this
>>> plumbing,
>>> >>> by creating a UnixDir impl that, using JNI (C code), passes advanced
>>> >>> flags when opening files, based on the incoming IOContext.
>>> >>>
>>> >>> The goal is a single UnixDir that has ifdefs so that it's usable
>>> >>> across multiple Unices, and eg would use direct IO if the context is
>>> >>> merging.  If we are ambitious we could rope Windows into the mix,
>>> too,
>>> >>> and then this would be NativeDir...
>>> >>>
>>> >>> We can measure success by validating that a big merge while searching
>>> >>> does not hurt search performance?  (Ie we should be able to reproduce
>>> >>> the results from
>>> >>>
>>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html).
>>> >>
>>> >> Thanks for the summary mike!
>>> >>>
>>> >>>> I have spoken to Micheal McCandless and Simon Willnauer about
>>> >>>> undertaking these tasks. Micheal McCandless has agreed to mentor me
>>> .
>>> >>>> I would love to be able to contribute and learn from Apache Lucene
>>> >>>> community this summer. Also I would love suggestions on how to make
>>> my
>>> >>>> application proposal stronger.
>>> >>>
>>> >>> I think either Simon or I can be the "official" mentor, and then the
>>> >>> other one of us (and other Lucene committers) will support/chime
>>> >>> in...
>>> >>
>>> >> I will take the official responsibility here once we are there!
>>> >> simon
>>> >>>
>>> >>> This is an important change for Lucene!
>>> >>>
>>> >>> Mike
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> >
>>> > Regards,
>>> > Varun Thacker
>>> > http://varunthacker.wordpress.com
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>>
>> --
>>
>>
>> Regards,
>> Varun Thacker
>> http://varunthacker.wordpress.com
>>
>>
>>
>


-- 


Regards,
Varun Thacker
http://varunthacker.wordpress.com

Re: My GSOC proposal

Posted by Adriano Crestani <ad...@gmail.com>.

Hi Varun,

Nice proposal, very complete. Only one thing missing, you should mention
somewhere how many hours a week you are willing to spend working on the
project and whether there is any holiday you won't be able to work.

Good luck ;)

On Wed, Apr 6, 2011 at 5:57 PM, Varun Thacker <va...@gmail.com>wrote:

> I have drafted the proposal on the official GSoC website . This is the link
> to my proposal http://goo.gl/uYXrV . Please do let me know if anything
> needs to be changed ,added or removed.
>
> I will keep on working on it till the deadline on the 8th.
>
> On Wed, Apr 6, 2011 at 11:41 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> That test code looks good -- you really should have seen awful
>> performance had you used O_DIRECT since you read byte by byte.
>>
>> A more realistic test is to read a whole buffer (eg 4 KB is what
>> Lucene now uses during merging, but we'd probably up this to like 1 MB
>> when using O_DIRECT).
>>
>> Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
>> for good reason: its existence means projects like ours can use it to
>> "work around" limitations in the Linux IO apis that control the buffer
>> cache when, otherwise, we might conceivably make patches to fix Linux
>> correctly.  It's an escape hatch, and we all use the escape hatch
>> instead of trying to fix Linux for real...
>>
>> For example the NOREUSE flag is a no-op now in Linux, which is a
>> shame, because that's precisely the flag we'd want to use for merging
>> (along with SEQUENTIAL).  Had that flag been implemented well, it'd
>> give better results than our workaround using O_DIRECT.
>>
>> Anyway, giving how things are, until we can get more control (waaaay
>> up in Javaland) over the buffer cache, O_DIRECT (via native directory
>> impl through JNI) is our only real option, today.
>>
>> More details here:
>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
>>
>> Note that other OSs likely do a better job and actually implement
>> NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
>> would simply use NOREUSE on these platforms for I/O during segment
>> merging.
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
>> <va...@gmail.com> wrote:
>> > Hi. I wrote a sample code to test out speed difference between
>> SEQUENTIAL
>> > and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
>> >
>> > This is the link to the code: http://pastebin.com/8QywKGyS
>> >
>> > There was a speed difference which when i switched between the two
>> flags. I
>> > have not used the O_DIRECT flag because Linus had criticized it.
>> >
>> > Is this what the flags are intended to be used for ? This is just a
>> sample
>> > code with a test file .
>> >
>> > On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
>> > <si...@googlemail.com> wrote:
>> >> Hey Varun,
>> >> On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
>> >> <lu...@mikemccandless.com> wrote:
>> >>> Hi Varun,
>> >>>
>> >>> Those two issues would make a great GSoC!  Comments below...
>> >> +1
>> >>>
>> >>> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
>> >>> <va...@gmail.com> wrote:
>> >>>
>> >>>> I would like to combine two tasks as part of my project
>> >>>> namely-Directory createOutput and openInput should take an IOContext
>> >>>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
>> >>>> UnixDir (Lucene-2795).
>> >>>>
>> >>>> The first part of the project is aimed at significantly reducing time
>> >>>> taken to search during indexing by adding an IOContext which would
>> >>>> store buffer size and have options to bypass the OS’s buffer cache
>> >>>> (This is what causes the slowdown in search ) and other hints. Once
>> >>>> completed I would move on to Lucene-2795 and generalize the Directory
>> >>>> implementation to make a UnixDirectory .
>> >>>
>> >>> So, the first part (LUCENE-2793) should cause no change at all to
>> >>> performance, functionality, etc., because it's "merely" installing the
>> >>> plumbing (IOContext threaded throughout the low-level store APIs in
>> >>> Lucene) so that higher levels can send important details down to the
>> >>> Directory.  We'd fix IndexWriter/IndexReader to fill out this
>> >>> IOContext with the details (merging, flushing, new reader, etc.).
>> >>>
>> >>> There's some fun/freedom here in figuring out just what details should
>> >>> be included in IOContext... (eg: is it low level "set buffer size to 4
>> >>> KB"
>> >>> or is it high level "I am opening a new near-real-time reader").
>> >>>
>> >>> This first step is a rote cutover, just changing APIs but in no way
>> >>> taking advantage of the new APIs.
>> >>>
>> >>> The 2nd step (LUCENE-2795) would then take advantage of this plumbing,
>> >>> by creating a UnixDir impl that, using JNI (C code), passes advanced
>> >>> flags when opening files, based on the incoming IOContext.
>> >>>
>> >>> The goal is a single UnixDir that has ifdefs so that it's usable
>> >>> across multiple Unices, and eg would use direct IO if the context is
>> >>> merging.  If we are ambitious we could rope Windows into the mix, too,
>> >>> and then this would be NativeDir...
>> >>>
>> >>> We can measure success by validating that a big merge while searching
>> >>> does not hurt search performance?  (Ie we should be able to reproduce
>> >>> the results from
>> >>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
>> ).
>> >>
>> >> Thanks for the summary mike!
>> >>>
>> >>>> I have spoken to Micheal McCandless and Simon Willnauer about
>> >>>> undertaking these tasks. Micheal McCandless has agreed to mentor me .
>> >>>> I would love to be able to contribute and learn from Apache Lucene
>> >>>> community this summer. Also I would love suggestions on how to make
>> my
>> >>>> application proposal stronger.
>> >>>
>> >>> I think either Simon or I can be the "official" mentor, and then the
>> >>> other one of us (and other Lucene committers) will support/chime
>> >>> in...
>> >>
>> >> I will take the official responsibility here once we are there!
>> >> simon
>> >>>
>> >>> This is an important change for Lucene!
>> >>>
>> >>> Mike
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> >
>> >
>> > Regards,
>> > Varun Thacker
>> > http://varunthacker.wordpress.com
>> >
>> >
>> >
>> >
>>
>
>
>
> --
>
>
> Regards,
> Varun Thacker
> http://varunthacker.wordpress.com
>
>
>

Re: My GSOC proposal

Posted by Varun Thacker <va...@gmail.com>.

I have drafted the proposal on the official GSoC website . This is the link
to my proposal http://goo.gl/uYXrV . Please do let me know if anything needs
to be changed ,added or removed.

I will keep on working on it till the deadline on the 8th.

On Wed, Apr 6, 2011 at 11:41 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> That test code looks good -- you really should have seen awful
> performance had you used O_DIRECT since you read byte by byte.
>
> A more realistic test is to read a whole buffer (eg 4 KB is what
> Lucene now uses during merging, but we'd probably up this to like 1 MB
> when using O_DIRECT).
>
> Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
> for good reason: its existence means projects like ours can use it to
> "work around" limitations in the Linux IO apis that control the buffer
> cache when, otherwise, we might conceivably make patches to fix Linux
> correctly.  It's an escape hatch, and we all use the escape hatch
> instead of trying to fix Linux for real...
>
> For example the NOREUSE flag is a no-op now in Linux, which is a
> shame, because that's precisely the flag we'd want to use for merging
> (along with SEQUENTIAL).  Had that flag been implemented well, it'd
> give better results than our workaround using O_DIRECT.
>
> Anyway, giving how things are, until we can get more control (waaaay
> up in Javaland) over the buffer cache, O_DIRECT (via native directory
> impl through JNI) is our only real option, today.
>
> More details here:
> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
>
> Note that other OSs likely do a better job and actually implement
> NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
> would simply use NOREUSE on these platforms for I/O during segment
> merging.
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
> <va...@gmail.com> wrote:
> > Hi. I wrote a sample code to test out speed difference between SEQUENTIAL
> > and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
> >
> > This is the link to the code: http://pastebin.com/8QywKGyS
> >
> > There was a speed difference which when i switched between the two flags.
> I
> > have not used the O_DIRECT flag because Linus had criticized it.
> >
> > Is this what the flags are intended to be used for ? This is just a
> sample
> > code with a test file .
> >
> > On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
> > <si...@googlemail.com> wrote:
> >> Hey Varun,
> >> On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
> >> <lu...@mikemccandless.com> wrote:
> >>> Hi Varun,
> >>>
> >>> Those two issues would make a great GSoC!  Comments below...
> >> +1
> >>>
> >>> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
> >>> <va...@gmail.com> wrote:
> >>>
> >>>> I would like to combine two tasks as part of my project
> >>>> namely-Directory createOutput and openInput should take an IOContext
> >>>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
> >>>> UnixDir (Lucene-2795).
> >>>>
> >>>> The first part of the project is aimed at significantly reducing time
> >>>> taken to search during indexing by adding an IOContext which would
> >>>> store buffer size and have options to bypass the OS’s buffer cache
> >>>> (This is what causes the slowdown in search ) and other hints. Once
> >>>> completed I would move on to Lucene-2795 and generalize the Directory
> >>>> implementation to make a UnixDirectory .
> >>>
> >>> So, the first part (LUCENE-2793) should cause no change at all to
> >>> performance, functionality, etc., because it's "merely" installing the
> >>> plumbing (IOContext threaded throughout the low-level store APIs in
> >>> Lucene) so that higher levels can send important details down to the
> >>> Directory.  We'd fix IndexWriter/IndexReader to fill out this
> >>> IOContext with the details (merging, flushing, new reader, etc.).
> >>>
> >>> There's some fun/freedom here in figuring out just what details should
> >>> be included in IOContext... (eg: is it low level "set buffer size to 4
> >>> KB"
> >>> or is it high level "I am opening a new near-real-time reader").
> >>>
> >>> This first step is a rote cutover, just changing APIs but in no way
> >>> taking advantage of the new APIs.
> >>>
> >>> The 2nd step (LUCENE-2795) would then take advantage of this plumbing,
> >>> by creating a UnixDir impl that, using JNI (C code), passes advanced
> >>> flags when opening files, based on the incoming IOContext.
> >>>
> >>> The goal is a single UnixDir that has ifdefs so that it's usable
> >>> across multiple Unices, and eg would use direct IO if the context is
> >>> merging.  If we are ambitious we could rope Windows into the mix, too,
> >>> and then this would be NativeDir...
> >>>
> >>> We can measure success by validating that a big merge while searching
> >>> does not hurt search performance?  (Ie we should be able to reproduce
> >>> the results from
> >>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html
> ).
> >>
> >> Thanks for the summary mike!
> >>>
> >>>> I have spoken to Micheal McCandless and Simon Willnauer about
> >>>> undertaking these tasks. Micheal McCandless has agreed to mentor me .
> >>>> I would love to be able to contribute and learn from Apache Lucene
> >>>> community this summer. Also I would love suggestions on how to make my
> >>>> application proposal stronger.
> >>>
> >>> I think either Simon or I can be the "official" mentor, and then the
> >>> other one of us (and other Lucene committers) will support/chime
> >>> in...
> >>
> >> I will take the official responsibility here once we are there!
> >> simon
> >>>
> >>> This is an important change for Lucene!
> >>>
> >>> Mike
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>
> >>>
> >>
> >
> >
> >
> > --
> >
> >
> > Regards,
> > Varun Thacker
> > http://varunthacker.wordpress.com
> >
> >
> >
> >
>



-- 


Regards,
Varun Thacker
http://varunthacker.wordpress.com

Re: My GSOC proposal

Posted by Michael McCandless <lu...@mikemccandless.com>.

That test code looks good -- you really should have seen awful
performance had you used O_DIRECT since you read byte by byte.

A more realistic test is to read a whole buffer (eg 4 KB is what
Lucene now uses during merging, but we'd probably up this to like 1 MB
when using O_DIRECT).

Linus does hate O_DIRECT (see http://kerneltrap.org/node/7563), and
for good reason: its existence means projects like ours can use it to
"work around" limitations in the Linux IO apis that control the buffer
cache when, otherwise, we might conceivably make patches to fix Linux
correctly.  It's an escape hatch, and we all use the escape hatch
instead of trying to fix Linux for real...

For example the NOREUSE flag is a no-op now in Linux, which is a
shame, because that's precisely the flag we'd want to use for merging
(along with SEQUENTIAL).  Had that flag been implemented well, it'd
give better results than our workaround using O_DIRECT.

Anyway, giving how things are, until we can get more control (waaaay
up in Javaland) over the buffer cache, O_DIRECT (via native directory
impl through JNI) is our only real option, today.

More details here:
http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html

Note that other OSs likely do a better job and actually implement
NOREUSE, and similar APIs, so the generic Unix/WindowsNativeDirectory
would simply use NOREUSE on these platforms for I/O during segment
merging.

Mike

http://blog.mikemccandless.com

On Wed, Apr 6, 2011 at 11:56 AM, Varun Thacker
<va...@gmail.com> wrote:
> Hi. I wrote a sample code to test out speed difference between SEQUENTIAL
> and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .
>
> This is the link to the code: http://pastebin.com/8QywKGyS
>
> There was a speed difference which when i switched between the two flags. I
> have not used the O_DIRECT flag because Linus had criticized it.
>
> Is this what the flags are intended to be used for ? This is just a sample
> code with a test file .
>
> On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer
> <si...@googlemail.com> wrote:
>> Hey Varun,
>> On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>> Hi Varun,
>>>
>>> Those two issues would make a great GSoC!  Comments below...
>> +1
>>>
>>> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
>>> <va...@gmail.com> wrote:
>>>
>>>> I would like to combine two tasks as part of my project
>>>> namely-Directory createOutput and openInput should take an IOContext
>>>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
>>>> UnixDir (Lucene-2795).
>>>>
>>>> The first part of the project is aimed at significantly reducing time
>>>> taken to search during indexing by adding an IOContext which would
>>>> store buffer size and have options to bypass the OS’s buffer cache
>>>> (This is what causes the slowdown in search ) and other hints. Once
>>>> completed I would move on to Lucene-2795 and generalize the Directory
>>>> implementation to make a UnixDirectory .
>>>
>>> So, the first part (LUCENE-2793) should cause no change at all to
>>> performance, functionality, etc., because it's "merely" installing the
>>> plumbing (IOContext threaded throughout the low-level store APIs in
>>> Lucene) so that higher levels can send important details down to the
>>> Directory.  We'd fix IndexWriter/IndexReader to fill out this
>>> IOContext with the details (merging, flushing, new reader, etc.).
>>>
>>> There's some fun/freedom here in figuring out just what details should
>>> be included in IOContext... (eg: is it low level "set buffer size to 4
>>> KB"
>>> or is it high level "I am opening a new near-real-time reader").
>>>
>>> This first step is a rote cutover, just changing APIs but in no way
>>> taking advantage of the new APIs.
>>>
>>> The 2nd step (LUCENE-2795) would then take advantage of this plumbing,
>>> by creating a UnixDir impl that, using JNI (C code), passes advanced
>>> flags when opening files, based on the incoming IOContext.
>>>
>>> The goal is a single UnixDir that has ifdefs so that it's usable
>>> across multiple Unices, and eg would use direct IO if the context is
>>> merging.  If we are ambitious we could rope Windows into the mix, too,
>>> and then this would be NativeDir...
>>>
>>> We can measure success by validating that a big merge while searching
>>> does not hurt search performance?  (Ie we should be able to reproduce
>>> the results from
>>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html).
>>
>> Thanks for the summary mike!
>>>
>>>> I have spoken to Micheal McCandless and Simon Willnauer about
>>>> undertaking these tasks. Micheal McCandless has agreed to mentor me .
>>>> I would love to be able to contribute and learn from Apache Lucene
>>>> community this summer. Also I would love suggestions on how to make my
>>>> application proposal stronger.
>>>
>>> I think either Simon or I can be the "official" mentor, and then the
>>> other one of us (and other Lucene committers) will support/chime
>>> in...
>>
>> I will take the official responsibility here once we are there!
>> simon
>>>
>>> This is an important change for Lucene!
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>
>
>
> --
>
>
> Regards,
> Varun Thacker
> http://varunthacker.wordpress.com
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: My GSOC proposal

Posted by Varun Thacker <va...@gmail.com>.

 Hi. I wrote a sample code to test out speed difference between SEQUENTIAL
and O_DIRECT( I used the madvise flag-MADV_DONTNEED) reads .

This is the link to the code: http://pastebin.com/8QywKGyS

There was a speed difference which when i switched between the two flags. I
have not used the O_DIRECT flag because Linus had criticized it.

Is this what the flags are intended to be used for ? This is just a sample
code with a test file .

On Wed, Apr 6, 2011 at 12:11 PM, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:
> Hey Varun,
> On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Hi Varun,
>>
>> Those two issues would make a great GSoC!  Comments below...
> +1
>>
>> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
>> <va...@gmail.com> wrote:
>>
>>> I would like to combine two tasks as part of my project
>>> namely-Directory createOutput and openInput should take an IOContext
>>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
>>> UnixDir (Lucene-2795).
>>>
>>> The first part of the project is aimed at significantly reducing time
>>> taken to search during indexing by adding an IOContext which would
>>> store buffer size and have options to bypass the OS’s buffer cache
>>> (This is what causes the slowdown in search ) and other hints. Once
>>> completed I would move on to Lucene-2795 and generalize the Directory
>>> implementation to make a UnixDirectory .
>>
>> So, the first part (LUCENE-2793) should cause no change at all to
>> performance, functionality, etc., because it's "merely" installing the
>> plumbing (IOContext threaded throughout the low-level store APIs in
>> Lucene) so that higher levels can send important details down to the
>> Directory.  We'd fix IndexWriter/IndexReader to fill out this
>> IOContext with the details (merging, flushing, new reader, etc.).
>>
>> There's some fun/freedom here in figuring out just what details should
>> be included in IOContext... (eg: is it low level "set buffer size to 4
KB"
>> or is it high level "I am opening a new near-real-time reader").
>>
>> This first step is a rote cutover, just changing APIs but in no way
>> taking advantage of the new APIs.
>>
>> The 2nd step (LUCENE-2795) would then take advantage of this plumbing,
>> by creating a UnixDir impl that, using JNI (C code), passes advanced
>> flags when opening files, based on the incoming IOContext.
>>
>> The goal is a single UnixDir that has ifdefs so that it's usable
>> across multiple Unices, and eg would use direct IO if the context is
>> merging.  If we are ambitious we could rope Windows into the mix, too,
>> and then this would be NativeDir...
>>
>> We can measure success by validating that a big merge while searching
>> does not hurt search performance?  (Ie we should be able to reproduce
>> the results from
>> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html).
>
> Thanks for the summary mike!
>>
>>> I have spoken to Micheal McCandless and Simon Willnauer about
>>> undertaking these tasks. Micheal McCandless has agreed to mentor me .
>>> I would love to be able to contribute and learn from Apache Lucene
>>> community this summer. Also I would love suggestions on how to make my
>>> application proposal stronger.
>>
>> I think either Simon or I can be the "official" mentor, and then the
>> other one of us (and other Lucene committers) will support/chime
>> in...
>
> I will take the official responsibility here once we are there!
> simon
>>
>> This is an important change for Lucene!
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>



-- 


Regards,
Varun Thacker
http://varunthacker.wordpress.com

Re: My GSOC proposal

Posted by Simon Willnauer <si...@googlemail.com>.

Hey Varun,
On Tue, Apr 5, 2011 at 11:07 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Hi Varun,
>
> Those two issues would make a great GSoC!  Comments below...
+1
>
> On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
> <va...@gmail.com> wrote:
>
>> I would like to combine two tasks as part of my project
>> namely-Directory createOutput and openInput should take an IOContext
>> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
>> UnixDir (Lucene-2795).
>>
>> The first part of the project is aimed at significantly reducing time
>> taken to search during indexing by adding an IOContext which would
>> store buffer size and have options to bypass the OS’s buffer cache
>> (This is what causes the slowdown in search ) and other hints. Once
>> completed I would move on to Lucene-2795 and generalize the Directory
>> implementation to make a UnixDirectory .
>
> So, the first part (LUCENE-2793) should cause no change at all to
> performance, functionality, etc., because it's "merely" installing the
> plumbing (IOContext threaded throughout the low-level store APIs in
> Lucene) so that higher levels can send important details down to the
> Directory.  We'd fix IndexWriter/IndexReader to fill out this
> IOContext with the details (merging, flushing, new reader, etc.).
>
> There's some fun/freedom here in figuring out just what details should
> be included in IOContext... (eg: is it low level "set buffer size to 4 KB"
> or is it high level "I am opening a new near-real-time reader").
>
> This first step is a rote cutover, just changing APIs but in no way
> taking advantage of the new APIs.
>
> The 2nd step (LUCENE-2795) would then take advantage of this plumbing,
> by creating a UnixDir impl that, using JNI (C code), passes advanced
> flags when opening files, based on the incoming IOContext.
>
> The goal is a single UnixDir that has ifdefs so that it's usable
> across multiple Unices, and eg would use direct IO if the context is
> merging.  If we are ambitious we could rope Windows into the mix, too,
> and then this would be NativeDir...
>
> We can measure success by validating that a big merge while searching
> does not hurt search performance?  (Ie we should be able to reproduce
> the results from
> http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html).

Thanks for the summary mike!
>
>> I have spoken to Micheal McCandless and Simon Willnauer about
>> undertaking these tasks. Micheal McCandless has agreed to mentor me .
>> I would love to be able to contribute and learn from Apache Lucene
>> community this summer. Also I would love suggestions on how to make my
>> application proposal stronger.
>
> I think either Simon or I can be the "official" mentor, and then the
> other one of us (and other Lucene committers) will support/chime
> in...

I will take the official responsibility here once we are there!
simon
>
> This is an important change for Lucene!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: My GSOC proposal

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hi Varun,

Those two issues would make a great GSoC!  Comments below...

On Tue, Apr 5, 2011 at 1:56 PM, Varun Thacker
<va...@gmail.com> wrote:

> I would like to combine two tasks as part of my project
> namely-Directory createOutput and openInput should take an IOContext
> (Lucene-2793) and compliment it by Generalize DirectIOLinuxDir to
> UnixDir (Lucene-2795).
>
> The first part of the project is aimed at significantly reducing time
> taken to search during indexing by adding an IOContext which would
> store buffer size and have options to bypass the OS’s buffer cache
> (This is what causes the slowdown in search ) and other hints. Once
> completed I would move on to Lucene-2795 and generalize the Directory
> implementation to make a UnixDirectory .

So, the first part (LUCENE-2793) should cause no change at all to
performance, functionality, etc., because it's "merely" installing the
plumbing (IOContext threaded throughout the low-level store APIs in
Lucene) so that higher levels can send important details down to the
Directory.  We'd fix IndexWriter/IndexReader to fill out this
IOContext with the details (merging, flushing, new reader, etc.).

There's some fun/freedom here in figuring out just what details should
be included in IOContext... (eg: is it low level "set buffer size to 4 KB"
or is it high level "I am opening a new near-real-time reader").

This first step is a rote cutover, just changing APIs but in no way
taking advantage of the new APIs.

The 2nd step (LUCENE-2795) would then take advantage of this plumbing,
by creating a UnixDir impl that, using JNI (C code), passes advanced
flags when opening files, based on the incoming IOContext.

The goal is a single UnixDir that has ifdefs so that it's usable
across multiple Unices, and eg would use direct IO if the context is
merging.  If we are ambitious we could rope Windows into the mix, too,
and then this would be NativeDir...

We can measure success by validating that a big merge while searching
does not hurt search performance?  (Ie we should be able to reproduce
the results from
http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html).

> I have spoken to Micheal McCandless and Simon Willnauer about
> undertaking these tasks. Micheal McCandless has agreed to mentor me .
> I would love to be able to contribute and learn from Apache Lucene
> community this summer. Also I would love suggestions on how to make my
> application proposal stronger.

I think either Simon or I can be the "official" mentor, and then the
other one of us (and other Lucene committers) will support/chime
in...

This is an important change for Lucene!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org