You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2006/02/15 18:23:42 UTC

Single NutchBean and multiple indices support

Hi there.

I am facing the same the question and looking for same solution.
Your solution seems easy:) My question is what file system the
application runs on?
LocalFileSystem or DistributedFileSystem?

Thanks
/Jack

On 2/9/06, Ravi Chintakunta <ra...@gmail.com> wrote:
> Hi David,
>
> Thanks for your reply.
>
> After posting the question, I have done this in a more optimum way.
>
> - I used only a single NutchBean and modified it so that the search
> method takes the indices being searched as an argument. This single
> NutchBean creates separate IndexReaders on the merged indices in the
> directories and keeps them in a map.
>
> - Based on the indexes that are searched, NutchBean creates an
> IndexSearcher using the appropriate IndexReaders. I have added a
> constructor to IndexSearcher that takes an array of IndexReaders and
> uses a MultiReader to initialize itself.
>
> - The NutchBean creates a single FetchedSegments with the combination
> of the segments directories in all the directories.
>
> The advantages with this are:
>
> - A single IndexReader for an index - so no additional filehandles are created.
> - No opening / closing of readers or segments - this improves performance.
>
>
> - Ravi Chintakunta
>
>
> > This is almost exactly what I've done.  I create a new NutchBean for
> > each search, and point it at whichever of 9 subdirectories the user has
> > selected; because I really don't want 511 (2^9-1) beans hanging around.
> >
> > The reason for the "too many open files" is that the NutchBean doesn't
> > clean up after itself - I guess because for most people, the NutchBean
> > is going to be reused.
> >
> > I added a close() method to FetchSegments.Segment in my installation,
> > to close all the readers.  I added a closeSegments() method to
> > NutchBean, to call close() on each segment that's been opened.  Then I
> > call closeSegments() after each search.
> >
> > I realise that NutchBean really wasn't designed to support being
> > instantiated once per search, but I don't care.  It works well, and
> > performance is not an issue.
> >
> > Regards,
> > David.
> >
> >
> > Date: Mon, 6 Feb 2006 20:59:34 -0500
> > From: Ravi Chintakunta <ra...@gmail.com>
> > To: nutch-user@lucene.apache.org
> > Subject: [Nutch-general] Dynamic merging of indices
> > Reply-To: user@nutch.org
> >
> > I have multiple indices for the crawls across various intranet sites
> > stored in separate folders. My search application should support
> > searching across one or more of these indices dynamically - by way of
> > checkboxes on the web page.  For this, I have modified NutchBean to
> > create the IndexSearcher and FetchedSegments from the segments
> > directory (not the merged index directory) in these folders.  Based on
> > the selected intranet sites, a NutchBean is instantiated for the
> > indices  of the selected sites and the results are displayed.
> >
> > With this I had the "Too many open files error" and have increased the
> > number of files limit.
> >
> > This seems to work well now. But if I have 5 such sites, then I am
> > opening 2^5 =3D 32 times more files than I would have opened.
> >
> > My question is: Is there a better way of doing this? Like:
> >
> > - Can I open an IndexReader on each of the merged index directory and
> > dynamically create an IndexSearcher by merging these readers using
> > MultiReader?
> >
> > - Is an IndexReader thread safe and can it be used simultaneously in
> > different IndexSearchers?
> >
> > - Can I create the IndexReader on the merged index directory and
> > create the corresponding FetchedSegments on the corresponding
> > non-merged segments directory?
> >
> > Thanks
> > Ravi Chintakunta
> >
> >
> >
> >
> > ********************************************************************************
> > This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
> > communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
> > information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.
> >
> > All emails have been scanned for viruses and content by MailMarshal.
> > NZQA reserves the right to monitor all email communications through its network.
> >
> > ********************************************************************************
> >
> >
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Single NutchBean and multiple indices support

Posted by Ravi Chintakunta <ra...@gmail.com>.

Hi Jack,

It runs on a local file system.

- Ravi Chintakunta

On 2/15/06, Jack Tang <hi...@gmail.com> wrote:
> Hi there.
>
> I am facing the same the question and looking for same solution.
> Your solution seems easy:) My question is what file system the
> application runs on?
> LocalFileSystem or DistributedFileSystem?
>
> Thanks
> /Jack
>
> On 2/9/06, Ravi Chintakunta <ra...@gmail.com> wrote:
> > Hi David,
> >
> > Thanks for your reply.
> >
> > After posting the question, I have done this in a more optimum way.
> >
> > - I used only a single NutchBean and modified it so that the search
> > method takes the indices being searched as an argument. This single
> > NutchBean creates separate IndexReaders on the merged indices in the
> > directories and keeps them in a map.
> >
> > - Based on the indexes that are searched, NutchBean creates an
> > IndexSearcher using the appropriate IndexReaders. I have added a
> > constructor to IndexSearcher that takes an array of IndexReaders and
> > uses a MultiReader to initialize itself.
> >
> > - The NutchBean creates a single FetchedSegments with the combination
> > of the segments directories in all the directories.
> >
> > The advantages with this are:
> >
> > - A single IndexReader for an index - so no additional filehandles are created.
> > - No opening / closing of readers or segments - this improves performance.
> >
> >
> > - Ravi Chintakunta
> >
> >
> > > This is almost exactly what I've done.  I create a new NutchBean for
> > > each search, and point it at whichever of 9 subdirectories the user has
> > > selected; because I really don't want 511 (2^9-1) beans hanging around.
> > >
> > > The reason for the "too many open files" is that the NutchBean doesn't
> > > clean up after itself - I guess because for most people, the NutchBean
> > > is going to be reused.
> > >
> > > I added a close() method to FetchSegments.Segment in my installation,
> > > to close all the readers.  I added a closeSegments() method to
> > > NutchBean, to call close() on each segment that's been opened.  Then I
> > > call closeSegments() after each search.
> > >
> > > I realise that NutchBean really wasn't designed to support being
> > > instantiated once per search, but I don't care.  It works well, and
> > > performance is not an issue.
> > >
> > > Regards,
> > > David.
> > >
> > >
> > > Date: Mon, 6 Feb 2006 20:59:34 -0500
> > > From: Ravi Chintakunta <ra...@gmail.com>
> > > To: nutch-user@lucene.apache.org
> > > Subject: [Nutch-general] Dynamic merging of indices
> > > Reply-To: user@nutch.org
> > >
> > > I have multiple indices for the crawls across various intranet sites
> > > stored in separate folders. My search application should support
> > > searching across one or more of these indices dynamically - by way of
> > > checkboxes on the web page.  For this, I have modified NutchBean to
> > > create the IndexSearcher and FetchedSegments from the segments
> > > directory (not the merged index directory) in these folders.  Based on
> > > the selected intranet sites, a NutchBean is instantiated for the
> > > indices  of the selected sites and the results are displayed.
> > >
> > > With this I had the "Too many open files error" and have increased the
> > > number of files limit.
> > >
> > > This seems to work well now. But if I have 5 such sites, then I am
> > > opening 2^5 =3D 32 times more files than I would have opened.
> > >
> > > My question is: Is there a better way of doing this? Like:
> > >
> > > - Can I open an IndexReader on each of the merged index directory and
> > > dynamically create an IndexSearcher by merging these readers using
> > > MultiReader?
> > >
> > > - Is an IndexReader thread safe and can it be used simultaneously in
> > > different IndexSearchers?
> > >
> > > - Can I create the IndexReader on the merged index directory and
> > > create the corresponding FetchedSegments on the corresponding
> > > non-merged segments directory?
> > >
> > > Thanks
> > > Ravi Chintakunta
> > >
> > >
> > >
> > >
> > > ********************************************************************************
> > > This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
> > > communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
> > > information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.
> > >
> > > All emails have been scanned for viruses and content by MailMarshal.
> > > NZQA reserves the right to monitor all email communications through its network.
> > >
> > > ********************************************************************************
> > >
> > >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>