You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by pof <Me...@gmail.com> on 2009/06/25 02:47:39 UTC

Index Ratio

Hi, I just completed a batch test index of ~1100 documents of various file
types and I noticed that the original documents take up about 145MB but my
index is only 1.7MB?? I remember reading somewhere that the typical
compression rate is about 20-30% or something, but mine is a little over 1%!
I'm not complaining or anything It just struck me a odd especially as I have
a lot of archive files and emails with attachments that I parse as well. Has
anyone else experienced something like this, I'm just curious.

Cheers. Brett.
-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Posted by Ted Dunning <te...@gmail.com>.

I was actually suggesting that you build synthetic documents so that you
know *exactly* that these documents exist and have known values for every
field.  Your test is good, but not comprehensive since it doesn't test every
field and one of the best ways to get a small index is to only index a few
fields with small values.

On Wed, Jun 24, 2009 at 7:39 PM, pof <Me...@gmail.com> wrote:

> (it is also very helpful to have some test documents with extraordinary
> values in key fields so that you can verify indexing and retrieval.  These
> are called tracer bullets in some quarters and it is handy to have at least
> one such tracer per input file.  You can also add corpus meta-data this way
> (n documents for file f).  If you put a special field on these documents
> you
> can include or exclude them from your retrievals with essentially no cost)
>
> I have done this to a small extent (Search for a few unique terms like a
> one
> off email address etc.) but I will give it more of a go.

Re: Index Ratio

Posted by pof <Me...@gmail.com>.




> Do retrievals work?  Are you sure that you are indexing all of the fields
> of
> interest?
> 
Seems so, I have only done a hanfull of test but so far so good.


> Is maxDoc() plausible?
> 
Yup.


> Do the term vectors for each field look right?
> 
I wouldn't know how to go about that.

(it is also very helpful to have some test documents with extraordinary
values in key fields so that you can verify indexing and retrieval.  These
are called tracer bullets in some quarters and it is handy to have at least
one such tracer per input file.  You can also add corpus meta-data this way
(n documents for file f).  If you put a special field on these documents you
can include or exclude them from your retrievals with essentially no cost)

I have done this to a small extent (Search for a few unique terms like a one
off email address etc.) but I will give it more of a go.

Cheers. Brett.
-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196086.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Posted by Ted Dunning <te...@gmail.com>.

That sounds a bit more than plausibly good.

Do retrievals work?  Are you sure that you are indexing all of the fields of
interest?

Is maxDoc() plausible?

Do the term vectors for each field look right?

(it is also very helpful to have some test documents with extraordinary
values in key fields so that you can verify indexing and retrieval.  These
are called tracer bullets in some quarters and it is handy to have at least
one such tracer per input file.  You can also add corpus meta-data this way
(n documents for file f).  If you put a special field on these documents you
can include or exclude them from your retrievals with essentially no cost)

On Wed, Jun 24, 2009 at 5:47 PM, pof <Me...@gmail.com> wrote:

>
> Hi, I just completed a batch test index of ~1100 documents of various file
> types and I noticed that the original documents take up about 145MB but my
> index is only 1.7MB?? I remember reading somewhere that the typical
> compression rate is about 20-30% or something, but mine is a little over
> 1%!
> I'm not complaining or anything It just struck me a odd especially as I
> have
> a lot of archive files and emails with attachments that I parse as well.
> Has
> anyone else experienced something like this, I'm just curious.
>
> Cheers. Brett.
> --
> View this message in context:
> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Re: Index Ratio

Posted by pof <Me...@gmail.com>.

Most of these files are of type .doc, .pdf and .msg. There are some .eml,
.txt, .htm, .docx and so on as well to a lesser extent. I did consider the
fact that the plain text makes up on a small percentage of each of these
propriatary file types but still the ratio did seem small.


Chris Collins wrote:
> 
> You mention documents of various file types.  It really depends on  
> what those types are.  For example the amount of text found in a  
> powerpoint file is slim pickins.  Ratios with office type apps tend to  
> be pretty fluffy.  I have seen considerably better than 20-30% when  
> extracting text from such formats, some down to the ratio your talking  
> of.
> 
> C
> On Jun 24, 2009, at 5:47 PM, pof wrote:
> 
>>
>> Hi, I just completed a batch test index of ~1100 documents of  
>> various file
>> types and I noticed that the original documents take up about 145MB  
>> but my
>> index is only 1.7MB?? I remember reading somewhere that the typical
>> compression rate is about 20-30% or something, but mine is a little  
>> over 1%!
>> I'm not complaining or anything It just struck me a odd especially  
>> as I have
>> a lot of archive files and emails with attachments that I parse as  
>> well. Has
>> anyone else experienced something like this, I'm just curious.
>>
>> Cheers. Brett.
>> -- 
>> View this message in context:
>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196644.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Posted by Chris Collins <ch...@yahoo.com>.

You mention documents of various file types.  It really depends on  
what those types are.  For example the amount of text found in a  
powerpoint file is slim pickins.  Ratios with office type apps tend to  
be pretty fluffy.  I have seen considerably better than 20-30% when  
extracting text from such formats, some down to the ratio your talking  
of.

C
On Jun 24, 2009, at 5:47 PM, pof wrote:

>
> Hi, I just completed a batch test index of ~1100 documents of  
> various file
> types and I noticed that the original documents take up about 145MB  
> but my
> index is only 1.7MB?? I remember reading somewhere that the typical
> compression rate is about 20-30% or something, but mine is a little  
> over 1%!
> I'm not complaining or anything It just struck me a odd especially  
> as I have
> a lot of archive files and emails with attachments that I parse as  
> well. Has
> anyone else experienced something like this, I'm just curious.
>
> Cheers. Brett.
> -- 
> View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

Re: Index Ratio

Posted by pof <Me...@gmail.com>.

Checked out the index with Luke, yep all the text has been indexed 100%
correctly. I have to say WOW Luke is a great little tool, I am majorly
impressed. Thanks guys for all you suggestions and insight.


pof wrote:
> 
> Three randomly selected documents
> 
> .doc = 125KB Plain text = 761 bytes (0.59%)
> .pdf = 372KB Plain text = 12.9KB (3.49%)
> .eml = 171KB Plain text = 2KB (1.15%)
> 
> Even though this is a small sample, it shows my index compression of 1-2%
> to be plausable. I'm checking out Luke index toolbox now.
> 
> Chris Collins wrote:
>> 
>> There are other factors too, such as how broad is the vocabulary of  
>> the content and your analyzers used.  Have you tried running your  
>> filters to generate just plain text files and compare the difference  
>> in size of the text compared to the original.
>> 
>> C
>> 
>> 
>> On Jun 24, 2009, at 9:28 PM, pof wrote:
>> 
>>>
>>> It would seem that .doc files have about 30KB overhead (not including
>>> pictures, graphs, meta data etc) on top of the plain text and about  
>>> 3KB for
>>> .pdfs.
>>>
>>> Otis Gospodnetic wrote:
>>>>
>>>>
>>>> Hi Brett,
>>>>
>>>> Try creating a simple MS Word document with just a single character  
>>>> in it.
>>>> Save it as .doc and check the size.  Export to PDF and check the  
>>>> size.  I
>>>> don't know exactly how big those docs will be, but I bet they'll be  
>>>> many,
>>>> many times larger than that one byte character.  Open up your index  
>>>> with
>>>> Luke to see what's in it.
>>>>
>>>> Otis
>>>> --
>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>
>>>>
>>>>
>>>> ----- Original Message ----
>>>>> From: pof <Me...@gmail.com>
>>>>> To: general@lucene.apache.org
>>>>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>>>>> Subject: Index Ratio
>>>>>
>>>>>
>>>>> Hi, I just completed a batch test index of ~1100 documents of  
>>>>> various
>>>>> file
>>>>> types and I noticed that the original documents take up about  
>>>>> 145MB but
>>>>> my
>>>>> index is only 1.7MB?? I remember reading somewhere that the typical
>>>>> compression rate is about 20-30% or something, but mine is a  
>>>>> little over
>>>>> 1%!
>>>>> I'm not complaining or anything It just struck me a odd especially  
>>>>> as I
>>>>> have
>>>>> a lot of archive files and emails with attachments that I parse as  
>>>>> well.
>>>>> Has
>>>>> anyone else experienced something like this, I'm just curious.
>>>>>
>>>>> Cheers. Brett.
>>>>> -- 
>>>>> View this message in context:
>>>>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>>>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24197200.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Posted by pof <Me...@gmail.com>.

Three randomly selected documents

.doc = 125KB Plain text = 761 bytes (0.59%)
.pdf = 372KB Plain text = 12.9KB (3.49%)
.eml = 171KB Plain text = 2KB (1.15%)

Even though this is a small sample, it shows my index compression of 1-2% to
be plausable. I'm checking out Luke index toolbox now.

Chris Collins wrote:
> 
> There are other factors too, such as how broad is the vocabulary of  
> the content and your analyzers used.  Have you tried running your  
> filters to generate just plain text files and compare the difference  
> in size of the text compared to the original.
> 
> C
> 
> 
> On Jun 24, 2009, at 9:28 PM, pof wrote:
> 
>>
>> It would seem that .doc files have about 30KB overhead (not including
>> pictures, graphs, meta data etc) on top of the plain text and about  
>> 3KB for
>> .pdfs.
>>
>> Otis Gospodnetic wrote:
>>>
>>>
>>> Hi Brett,
>>>
>>> Try creating a simple MS Word document with just a single character  
>>> in it.
>>> Save it as .doc and check the size.  Export to PDF and check the  
>>> size.  I
>>> don't know exactly how big those docs will be, but I bet they'll be  
>>> many,
>>> many times larger than that one byte character.  Open up your index  
>>> with
>>> Luke to see what's in it.
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>>> From: pof <Me...@gmail.com>
>>>> To: general@lucene.apache.org
>>>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>>>> Subject: Index Ratio
>>>>
>>>>
>>>> Hi, I just completed a batch test index of ~1100 documents of  
>>>> various
>>>> file
>>>> types and I noticed that the original documents take up about  
>>>> 145MB but
>>>> my
>>>> index is only 1.7MB?? I remember reading somewhere that the typical
>>>> compression rate is about 20-30% or something, but mine is a  
>>>> little over
>>>> 1%!
>>>> I'm not complaining or anything It just struck me a odd especially  
>>>> as I
>>>> have
>>>> a lot of archive files and emails with attachments that I parse as  
>>>> well.
>>>> Has
>>>> anyone else experienced something like this, I'm just curious.
>>>>
>>>> Cheers. Brett.
>>>> -- 
>>>> View this message in context:
>>>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24197002.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Posted by Chris Collins <ch...@yahoo.com>.

There are other factors too, such as how broad is the vocabulary of  
the content and your analyzers used.  Have you tried running your  
filters to generate just plain text files and compare the difference  
in size of the text compared to the original.

C


On Jun 24, 2009, at 9:28 PM, pof wrote:

>
> It would seem that .doc files have about 30KB overhead (not including
> pictures, graphs, meta data etc) on top of the plain text and about  
> 3KB for
> .pdfs.
>
> Otis Gospodnetic wrote:
>>
>>
>> Hi Brett,
>>
>> Try creating a simple MS Word document with just a single character  
>> in it.
>> Save it as .doc and check the size.  Export to PDF and check the  
>> size.  I
>> don't know exactly how big those docs will be, but I bet they'll be  
>> many,
>> many times larger than that one byte character.  Open up your index  
>> with
>> Luke to see what's in it.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>>> From: pof <Me...@gmail.com>
>>> To: general@lucene.apache.org
>>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>>> Subject: Index Ratio
>>>
>>>
>>> Hi, I just completed a batch test index of ~1100 documents of  
>>> various
>>> file
>>> types and I noticed that the original documents take up about  
>>> 145MB but
>>> my
>>> index is only 1.7MB?? I remember reading somewhere that the typical
>>> compression rate is about 20-30% or something, but mine is a  
>>> little over
>>> 1%!
>>> I'm not complaining or anything It just struck me a odd especially  
>>> as I
>>> have
>>> a lot of archive files and emails with attachments that I parse as  
>>> well.
>>> Has
>>> anyone else experienced something like this, I'm just curious.
>>>
>>> Cheers. Brett.
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

Re: Index Ratio

Posted by pof <Me...@gmail.com>.

It would seem that .doc files have about 30KB overhead (not including
pictures, graphs, meta data etc) on top of the plain text and about 3KB for
.pdfs.

Otis Gospodnetic wrote:
> 
> 
> Hi Brett,
> 
> Try creating a simple MS Word document with just a single character in it. 
> Save it as .doc and check the size.  Export to PDF and check the size.  I
> don't know exactly how big those docs will be, but I bet they'll be many,
> many times larger than that one byte character.  Open up your index with
> Luke to see what's in it.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: pof <Me...@gmail.com>
>> To: general@lucene.apache.org
>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>> Subject: Index Ratio
>> 
>> 
>> Hi, I just completed a batch test index of ~1100 documents of various
>> file
>> types and I noticed that the original documents take up about 145MB but
>> my
>> index is only 1.7MB?? I remember reading somewhere that the typical
>> compression rate is about 20-30% or something, but mine is a little over
>> 1%!
>> I'm not complaining or anything It just struck me a odd especially as I
>> have
>> a lot of archive files and emails with attachments that I parse as well.
>> Has
>> anyone else experienced something like this, I'm just curious.
>> 
>> Cheers. Brett.
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Index Ratio

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Brett,

Try creating a simple MS Word document with just a single character in it.  Save it as .doc and check the size.  Export to PDF and check the size.  I don't know exactly how big those docs will be, but I bet they'll be many, many times larger than that one byte character.  Open up your index with Luke to see what's in it.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: pof <Me...@gmail.com>
> To: general@lucene.apache.org
> Sent: Wednesday, June 24, 2009 8:47:39 PM
> Subject: Index Ratio
> 
> 
> Hi, I just completed a batch test index of ~1100 documents of various file
> types and I noticed that the original documents take up about 145MB but my
> index is only 1.7MB?? I remember reading somewhere that the typical
> compression rate is about 20-30% or something, but mine is a little over 1%!
> I'm not complaining or anything It just struck me a odd especially as I have
> a lot of archive files and emails with attachments that I parse as well. Has
> anyone else experienced something like this, I'm just curious.
> 
> Cheers. Brett.
> -- 
> View this message in context: 
> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
> Sent from the Lucene - General mailing list archive at Nabble.com.