You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ramprakash Ramamoorthy <yo...@gmail.com> on 2012/12/07 08:32:37 UTC

Separating the document dataset and the index dataset

Greetings,

         We are using lucene in our log analysis tool. We get data around
35Gb a day and we have this practice of zipping week old indices and then
unzip when need arises.

           Though the compression offers a huge saving with respect to disk
space, the decompression becomes an overhead. At times it takes around 10
minutes (de-compression takes 95% of the time) to search across a month
long set of logs. We need to unzip fully atleast to get the total count
from the index.

           My question is, we are setting Index.Store to true. Is there a
way where we can split the index dataset and the document dataset. In my
understanding, if at all separation is possible, the document dataset can
alone be zipped leaving the index dataset on disk? Will it be tangible to
do this? Any pointers?

           Or is adding more disks the only solution? Thanks in advance!

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
+91 9626975420

RE: Separating the document dataset and the index dataset

Posted by Jain Rahul <ja...@ivycomptech.com>.
Hi Ram,

You need to have lucene-codec.jar in classpath having the CompressingCodec.java and other related classes.

If you are having your stuff on top of lucene then you can set it by calling setCodec(Codec codec) in IndexWriterConfig.

But If you are using solr, then since I couldn't figure out a clean way to do it with solr, I just did a small below heck in Codec.java. So someone from community can guide us on it for a neat solution.

In org.apache.lucene.codecs .Codec.java by default it sets Lucene40 as default field format, I just changed it to allow to pass the "compressing" codec like -Dlucene.codec=Compressing

  //private static Codec defaultCodec = Codec.forName("Lucene40");
  private static Codec defaultCodec = Codec.forName(System.getProperty("lucene.codec", "Lucene40"));

Regards,
Rahul

-----Original Message-----
From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
Sent: 11 December 2012 15:02
To: java-user@lucene.apache.org
Subject: Re: Separating the document dataset and the index dataset

On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <ja...@ivycomptech.com> wrote:

> If you are using lucene 4.0 and afford to compress your document
> dataset while indexing, it will be a huge savings in terms of disk
> space and also in IO (resulting in indexing throughput).
>
> In our case, it has helped us a lot as compressed data size was
> roughly 3 times less than  of original document data set size.
>
> You may want to check  the below  link.
>
>
> http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-f
> ields-with-lucene
>
> Regards,
> Rahul
>

Thank you Rahul. That indeed seems promising. Just one doubt, how do I plug this  CompressingStoredFieldsFormat into my app, as in I tried bundling it in a codec, but not sure if I am proceeding in the right path. Any pointers would be of great help!

>
>
> -----Original Message-----
> From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> Sent: 07 December 2012 13:03
> To: java-user@lucene.apache.org
> Subject: Separating the document dataset and the index dataset
>
> Greetings,
>
>          We are using lucene in our log analysis tool. We get data
> around 35Gb a day and we have this practice of zipping week old
> indices and then unzip when need arises.
>
>            Though the compression offers a huge saving with respect to
> disk space, the decompression becomes an overhead. At times it takes
> around
> 10 minutes (de-compression takes 95% of the time) to search across a
> month long set of logs. We need to unzip fully atleast to get the
> total count from the index.
>
>            My question is, we are setting Index.Store to true. Is
> there a way where we can split the index dataset and the document
> dataset. In my understanding, if at all separation is possible, the
> document dataset can alone be zipped leaving the index dataset on
> disk? Will it be tangible to do this? Any pointers?
>
>            Or is adding more disks the only solution? Thanks in advance!
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> +91 9626975420
> This email and any attachments are confidential, and may be legally
> privileged and protected by copyright. If you are not the intended
> recipient dissemination or copying of this email is prohibited. If you
> have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system. Any views
> or opinions are solely those of the sender. This communication is not
> intended to form a binding contract unless expressly indicated to the
> contrary and properly authorised. Any actions taken on the basis of
> this email are at the recipient's own risk.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420
This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Separating the document dataset and the index dataset

Posted by Ramprakash Ramamoorthy <yo...@gmail.com>.
On Tue, Dec 11, 2012 at 4:10 PM, Uwe Schindler <us...@pangaea.de>wrote:

> In Lucene 4.1 the compressing codec is no longer a separate codec, the
> main Codec ("Lucene41") compresses by default. Just reindex your data or
> use IndexUpgrader.
>

Thanks Uwe. This one helped. My index size came down from 816 Mb to 198 Mb.
Win!

>
> Uwe
>
> -----
> UWE SCHINDLER
> Webserver/Middleware Development
> PANGAEA - Data Publisher for Earth & Environmental Science
> MARUM (Cognium building) - University of Bremen
> Room 0510, Hochschulring 18, D-28359 Bremen
> Tel.: +49 421 218 65595
> Fax:  +49 421 218 65505
> http://www.pangaea.de/
> E-mail: uschindler@pangaea.de
>
>
> > -----Original Message-----
> > From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> > Sent: Tuesday, December 11, 2012 11:36 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Separating the document dataset and the index dataset
> >
> > On Tue, Dec 11, 2012 at 3:14 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> >
> > > You can use Lucene 4.1 nightly builds from http://goo.gl/jZ6YD - it is
> > > not yet released, but upgrading from Lucene 4.0 is easy. If you are
> > > not yet on Lucene 4.0, there is more work to do, in that case a
> > > solution to your problem would be to save the stored fields in a
> > > separate database/whatever and only add *one* stored field to your
> > > index, containing the document ID inside this external database.
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> >
> >
> > Thank you Uwe. Already tried with the nightly build, but the codecs.jar
> in it
> > isn't having a compressing codec at all, Tried pulling out from the
> trunk and
> > then compiling, same issue, *org.apache.lucene.codecs.compressing*is
> > missing. Any pointers?
> >
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> > > > Sent: Tuesday, December 11, 2012 10:32 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Re: Separating the document dataset and the index dataset
> > > >
> > > > On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <ja...@ivycomptech.com>
> > > wrote:
> > > >
> > > > > If you are using lucene 4.0 and afford to compress your document
> > > > > dataset while indexing, it will be a huge savings in terms of disk
> > > > > space and also in IO (resulting in indexing throughput).
> > > > >
> > > > > In our case, it has helped us a lot as compressed data size was
> > > > > roughly 3 times less than  of original document data set size.
> > > > >
> > > > > You may want to check  the below  link.
> > > > >
> > > > >
> > > > > http://blog.jpountz.net/post/33247161884/efficient-compressed-stor
> > > > > ed-f
> > > > > ields-with-lucene
> > > > >
> > > > > Regards,
> > > > > Rahul
> > > > >
> > > >
> > > > Thank you Rahul. That indeed seems promising. Just one doubt, how do
> > > > I plug this  CompressingStoredFieldsFormat into my app, as in I
> > > > tried
> > > bundling
> > > > it in a codec, but not sure if I am proceeding in the right path.
> > > > Any
> > > pointers
> > > > would be of great help!
> > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Ramprakash Ramamoorthy
> > [mailto:youngestachiever@gmail.com]
> > > > > Sent: 07 December 2012 13:03
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Separating the document dataset and the index dataset
> > > > >
> > > > > Greetings,
> > > > >
> > > > >          We are using lucene in our log analysis tool. We get data
> > > > > around 35Gb a day and we have this practice of zipping week old
> > > > > indices and then unzip when need arises.
> > > > >
> > > > >            Though the compression offers a huge saving with
> > > > > respect to disk space, the decompression becomes an overhead. At
> > > > > times it takes around
> > > > > 10 minutes (de-compression takes 95% of the time) to search across
> > > > > a month long set of logs. We need to unzip fully atleast to get
> > > > > the total count from the index.
> > > > >
> > > > >            My question is, we are setting Index.Store to true. Is
> > > > > there a way where we can split the index dataset and the document
> > > > > dataset. In my understanding, if at all separation is possible,
> > > > > the document dataset can alone be zipped leaving the index dataset
> > > > > on disk? Will it be tangible to do this? Any pointers?
> > > > >
> > > > >            Or is adding more disks the only solution? Thanks in
> > > advance!
> > > > >
> > > > > --
> > > > > With Thanks and Regards,
> > > > > Ramprakash Ramamoorthy,
> > > > > +91 9626975420
> > > > > This email and any attachments are confidential, and may be
> > > > > legally privileged and protected by copyright. If you are not the
> > > > > intended recipient dissemination or copying of this email is
> > > > > prohibited. If you have received this in error, please notify the
> > > > > sender by replying by email and then delete the email completely
> > > > > from your system. Any views or opinions are solely those of the
> > > > > sender. This communication is not intended to form a binding
> > > > > contract unless expressly indicated to the contrary and properly
> > > > > authorised. Any actions taken on the basis of this email are at the
> > recipient's own risk.
> > > > >
> > > > > ------------------------------------------------------------------
> > > > > --- To unsubscribe, e-mail:
> > > > > java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > With Thanks and Regards,
> > > > Ramprakash Ramamoorthy,
> > > > Engineer Trainee,
> > > > Zoho Corporation.
> > > > +91 9626975420
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > With Thanks and Regards,
> > Ramprakash Ramamoorthy,
> > Engineer Trainee,
> > Zoho Corporation.
> > +91 9626975420
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420

RE: Separating the document dataset and the index dataset

Posted by Uwe Schindler <us...@pangaea.de>.
In Lucene 4.1 the compressing codec is no longer a separate codec, the main Codec ("Lucene41") compresses by default. Just reindex your data or use IndexUpgrader.

Uwe

-----
UWE SCHINDLER
Webserver/Middleware Development
PANGAEA - Data Publisher for Earth & Environmental Science
MARUM (Cognium building) - University of Bremen
Room 0510, Hochschulring 18, D-28359 Bremen
Tel.: +49 421 218 65595
Fax:  +49 421 218 65505
http://www.pangaea.de/
E-mail: uschindler@pangaea.de


> -----Original Message-----
> From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> Sent: Tuesday, December 11, 2012 11:36 AM
> To: java-user@lucene.apache.org
> Subject: Re: Separating the document dataset and the index dataset
> 
> On Tue, Dec 11, 2012 at 3:14 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > You can use Lucene 4.1 nightly builds from http://goo.gl/jZ6YD - it is
> > not yet released, but upgrading from Lucene 4.0 is easy. If you are
> > not yet on Lucene 4.0, there is more work to do, in that case a
> > solution to your problem would be to save the stored fields in a
> > separate database/whatever and only add *one* stored field to your
> > index, containing the document ID inside this external database.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> 
> 
> Thank you Uwe. Already tried with the nightly build, but the codecs.jar in it
> isn't having a compressing codec at all, Tried pulling out from the trunk and
> then compiling, same issue, *org.apache.lucene.codecs.compressing*is
> missing. Any pointers?
> 
> >
> >
> >
> > > -----Original Message-----
> > > From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> > > Sent: Tuesday, December 11, 2012 10:32 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Separating the document dataset and the index dataset
> > >
> > > On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <ja...@ivycomptech.com>
> > wrote:
> > >
> > > > If you are using lucene 4.0 and afford to compress your document
> > > > dataset while indexing, it will be a huge savings in terms of disk
> > > > space and also in IO (resulting in indexing throughput).
> > > >
> > > > In our case, it has helped us a lot as compressed data size was
> > > > roughly 3 times less than  of original document data set size.
> > > >
> > > > You may want to check  the below  link.
> > > >
> > > >
> > > > http://blog.jpountz.net/post/33247161884/efficient-compressed-stor
> > > > ed-f
> > > > ields-with-lucene
> > > >
> > > > Regards,
> > > > Rahul
> > > >
> > >
> > > Thank you Rahul. That indeed seems promising. Just one doubt, how do
> > > I plug this  CompressingStoredFieldsFormat into my app, as in I
> > > tried
> > bundling
> > > it in a codec, but not sure if I am proceeding in the right path.
> > > Any
> > pointers
> > > would be of great help!
> > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Ramprakash Ramamoorthy
> [mailto:youngestachiever@gmail.com]
> > > > Sent: 07 December 2012 13:03
> > > > To: java-user@lucene.apache.org
> > > > Subject: Separating the document dataset and the index dataset
> > > >
> > > > Greetings,
> > > >
> > > >          We are using lucene in our log analysis tool. We get data
> > > > around 35Gb a day and we have this practice of zipping week old
> > > > indices and then unzip when need arises.
> > > >
> > > >            Though the compression offers a huge saving with
> > > > respect to disk space, the decompression becomes an overhead. At
> > > > times it takes around
> > > > 10 minutes (de-compression takes 95% of the time) to search across
> > > > a month long set of logs. We need to unzip fully atleast to get
> > > > the total count from the index.
> > > >
> > > >            My question is, we are setting Index.Store to true. Is
> > > > there a way where we can split the index dataset and the document
> > > > dataset. In my understanding, if at all separation is possible,
> > > > the document dataset can alone be zipped leaving the index dataset
> > > > on disk? Will it be tangible to do this? Any pointers?
> > > >
> > > >            Or is adding more disks the only solution? Thanks in
> > advance!
> > > >
> > > > --
> > > > With Thanks and Regards,
> > > > Ramprakash Ramamoorthy,
> > > > +91 9626975420
> > > > This email and any attachments are confidential, and may be
> > > > legally privileged and protected by copyright. If you are not the
> > > > intended recipient dissemination or copying of this email is
> > > > prohibited. If you have received this in error, please notify the
> > > > sender by replying by email and then delete the email completely
> > > > from your system. Any views or opinions are solely those of the
> > > > sender. This communication is not intended to form a binding
> > > > contract unless expressly indicated to the contrary and properly
> > > > authorised. Any actions taken on the basis of this email are at the
> recipient's own risk.
> > > >
> > > > ------------------------------------------------------------------
> > > > --- To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > --
> > > With Thanks and Regards,
> > > Ramprakash Ramamoorthy,
> > > Engineer Trainee,
> > > Zoho Corporation.
> > > +91 9626975420
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Engineer Trainee,
> Zoho Corporation.
> +91 9626975420


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Separating the document dataset and the index dataset

Posted by Ramprakash Ramamoorthy <yo...@gmail.com>.
On Tue, Dec 11, 2012 at 3:14 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> You can use Lucene 4.1 nightly builds from http://goo.gl/jZ6YD - it is
> not yet released, but upgrading from Lucene 4.0 is easy. If you are not yet
> on Lucene 4.0, there is more work to do, in that case a solution to your
> problem would be to save the stored fields in a separate database/whatever
> and only add *one* stored field to your index, containing the document ID
> inside this external database.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de


Thank you Uwe. Already tried with the nightly build, but the codecs.jar in
it isn't having a compressing codec at all, Tried pulling out from the
trunk and then compiling, same issue,
*org.apache.lucene.codecs.compressing*is missing. Any pointers?

>
>
>
> > -----Original Message-----
> > From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> > Sent: Tuesday, December 11, 2012 10:32 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Separating the document dataset and the index dataset
> >
> > On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <ja...@ivycomptech.com>
> wrote:
> >
> > > If you are using lucene 4.0 and afford to compress your document
> > > dataset while indexing, it will be a huge savings in terms of disk
> > > space and also in IO (resulting in indexing throughput).
> > >
> > > In our case, it has helped us a lot as compressed data size was
> > > roughly 3 times less than  of original document data set size.
> > >
> > > You may want to check  the below  link.
> > >
> > >
> > > http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-f
> > > ields-with-lucene
> > >
> > > Regards,
> > > Rahul
> > >
> >
> > Thank you Rahul. That indeed seems promising. Just one doubt, how do I
> > plug this  CompressingStoredFieldsFormat into my app, as in I tried
> bundling
> > it in a codec, but not sure if I am proceeding in the right path. Any
> pointers
> > would be of great help!
> >
> > >
> > >
> > > -----Original Message-----
> > > From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> > > Sent: 07 December 2012 13:03
> > > To: java-user@lucene.apache.org
> > > Subject: Separating the document dataset and the index dataset
> > >
> > > Greetings,
> > >
> > >          We are using lucene in our log analysis tool. We get data
> > > around 35Gb a day and we have this practice of zipping week old
> > > indices and then unzip when need arises.
> > >
> > >            Though the compression offers a huge saving with respect to
> > > disk space, the decompression becomes an overhead. At times it takes
> > > around
> > > 10 minutes (de-compression takes 95% of the time) to search across a
> > > month long set of logs. We need to unzip fully atleast to get the
> > > total count from the index.
> > >
> > >            My question is, we are setting Index.Store to true. Is
> > > there a way where we can split the index dataset and the document
> > > dataset. In my understanding, if at all separation is possible, the
> > > document dataset can alone be zipped leaving the index dataset on
> > > disk? Will it be tangible to do this? Any pointers?
> > >
> > >            Or is adding more disks the only solution? Thanks in
> advance!
> > >
> > > --
> > > With Thanks and Regards,
> > > Ramprakash Ramamoorthy,
> > > +91 9626975420
> > > This email and any attachments are confidential, and may be legally
> > > privileged and protected by copyright. If you are not the intended
> > > recipient dissemination or copying of this email is prohibited. If you
> > > have received this in error, please notify the sender by replying by
> > > email and then delete the email completely from your system. Any views
> > > or opinions are solely those of the sender. This communication is not
> > > intended to form a binding contract unless expressly indicated to the
> > > contrary and properly authorised. Any actions taken on the basis of
> > > this email are at the recipient's own risk.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > With Thanks and Regards,
> > Ramprakash Ramamoorthy,
> > Engineer Trainee,
> > Zoho Corporation.
> > +91 9626975420
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420

RE: Separating the document dataset and the index dataset

Posted by Uwe Schindler <uw...@thetaphi.de>.
You can use Lucene 4.1 nightly builds from http://goo.gl/jZ6YD - it is not yet released, but upgrading from Lucene 4.0 is easy. If you are not yet on Lucene 4.0, there is more work to do, in that case a solution to your problem would be to save the stored fields in a separate database/whatever and only add *one* stored field to your index, containing the document ID inside this external database.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> Sent: Tuesday, December 11, 2012 10:32 AM
> To: java-user@lucene.apache.org
> Subject: Re: Separating the document dataset and the index dataset
> 
> On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <ja...@ivycomptech.com> wrote:
> 
> > If you are using lucene 4.0 and afford to compress your document
> > dataset while indexing, it will be a huge savings in terms of disk
> > space and also in IO (resulting in indexing throughput).
> >
> > In our case, it has helped us a lot as compressed data size was
> > roughly 3 times less than  of original document data set size.
> >
> > You may want to check  the below  link.
> >
> >
> > http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-f
> > ields-with-lucene
> >
> > Regards,
> > Rahul
> >
> 
> Thank you Rahul. That indeed seems promising. Just one doubt, how do I
> plug this  CompressingStoredFieldsFormat into my app, as in I tried bundling
> it in a codec, but not sure if I am proceeding in the right path. Any pointers
> would be of great help!
> 
> >
> >
> > -----Original Message-----
> > From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> > Sent: 07 December 2012 13:03
> > To: java-user@lucene.apache.org
> > Subject: Separating the document dataset and the index dataset
> >
> > Greetings,
> >
> >          We are using lucene in our log analysis tool. We get data
> > around 35Gb a day and we have this practice of zipping week old
> > indices and then unzip when need arises.
> >
> >            Though the compression offers a huge saving with respect to
> > disk space, the decompression becomes an overhead. At times it takes
> > around
> > 10 minutes (de-compression takes 95% of the time) to search across a
> > month long set of logs. We need to unzip fully atleast to get the
> > total count from the index.
> >
> >            My question is, we are setting Index.Store to true. Is
> > there a way where we can split the index dataset and the document
> > dataset. In my understanding, if at all separation is possible, the
> > document dataset can alone be zipped leaving the index dataset on
> > disk? Will it be tangible to do this? Any pointers?
> >
> >            Or is adding more disks the only solution? Thanks in advance!
> >
> > --
> > With Thanks and Regards,
> > Ramprakash Ramamoorthy,
> > +91 9626975420
> > This email and any attachments are confidential, and may be legally
> > privileged and protected by copyright. If you are not the intended
> > recipient dissemination or copying of this email is prohibited. If you
> > have received this in error, please notify the sender by replying by
> > email and then delete the email completely from your system. Any views
> > or opinions are solely those of the sender. This communication is not
> > intended to form a binding contract unless expressly indicated to the
> > contrary and properly authorised. Any actions taken on the basis of
> > this email are at the recipient's own risk.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> 
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Engineer Trainee,
> Zoho Corporation.
> +91 9626975420


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Separating the document dataset and the index dataset

Posted by Ramprakash Ramamoorthy <yo...@gmail.com>.
On Fri, Dec 7, 2012 at 1:11 PM, Jain Rahul <ja...@ivycomptech.com> wrote:

> If you are using lucene 4.0 and afford to compress your document dataset
> while indexing, it will be a huge savings in terms of disk space and also
> in IO (resulting in indexing throughput).
>
> In our case, it has helped us a lot as compressed data size was roughly 3
> times less than  of original document data set size.
>
> You may want to check  the below  link.
>
>
> http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene
>
> Regards,
> Rahul
>

Thank you Rahul. That indeed seems promising. Just one doubt, how do I plug
this  CompressingStoredFieldsFormat into my app, as in I tried bundling it
in a codec, but not sure if I am proceeding in the right path. Any pointers
would be of great help!

>
>
> -----Original Message-----
> From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
> Sent: 07 December 2012 13:03
> To: java-user@lucene.apache.org
> Subject: Separating the document dataset and the index dataset
>
> Greetings,
>
>          We are using lucene in our log analysis tool. We get data around
> 35Gb a day and we have this practice of zipping week old indices and then
> unzip when need arises.
>
>            Though the compression offers a huge saving with respect to
> disk space, the decompression becomes an overhead. At times it takes around
> 10 minutes (de-compression takes 95% of the time) to search across a month
> long set of logs. We need to unzip fully atleast to get the total count
> from the index.
>
>            My question is, we are setting Index.Store to true. Is there a
> way where we can split the index dataset and the document dataset. In my
> understanding, if at all separation is possible, the document dataset can
> alone be zipped leaving the index dataset on disk? Will it be tangible to
> do this? Any pointers?
>
>            Or is adding more disks the only solution? Thanks in advance!
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> +91 9626975420
> This email and any attachments are confidential, and may be legally
> privileged and protected by copyright. If you are not the intended
> recipient dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system. Any views or opinions
> are solely those of the sender. This communication is not intended to form
> a binding contract unless expressly indicated to the contrary and properly
> authorised. Any actions taken on the basis of this email are at the
> recipient's own risk.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420

RE: Separating the document dataset and the index dataset

Posted by Jain Rahul <ja...@ivycomptech.com>.
If you are using lucene 4.0 and afford to compress your document dataset while indexing, it will be a huge savings in terms of disk space and also in IO (resulting in indexing throughput).

In our case, it has helped us a lot as compressed data size was roughly 3 times less than  of original document data set size.

You may want to check  the below  link.

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Regards,
Rahul


-----Original Message-----
From: Ramprakash Ramamoorthy [mailto:youngestachiever@gmail.com]
Sent: 07 December 2012 13:03
To: java-user@lucene.apache.org
Subject: Separating the document dataset and the index dataset

Greetings,

         We are using lucene in our log analysis tool. We get data around 35Gb a day and we have this practice of zipping week old indices and then unzip when need arises.

           Though the compression offers a huge saving with respect to disk space, the decompression becomes an overhead. At times it takes around 10 minutes (de-compression takes 95% of the time) to search across a month long set of logs. We need to unzip fully atleast to get the total count from the index.

           My question is, we are setting Index.Store to true. Is there a way where we can split the index dataset and the document dataset. In my understanding, if at all separation is possible, the document dataset can alone be zipped leaving the index dataset on disk? Will it be tangible to do this? Any pointers?

           Or is adding more disks the only solution? Thanks in advance!

--
With Thanks and Regards,
Ramprakash Ramamoorthy,
+91 9626975420
This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org