You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris and Helen Bamford <ch...@bammers.net> on 2018/07/04 09:26:16 UTC

Size of Document

Hi there,

How can I calculate the total size of a Lucene Document that I'm about 
to write to an index so I know how many bytes I am writing please?  I 
need it for some external metrics collection.

Thanks

- Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: Size of Document
: To: java-user@lucene.apache.org
: References:
:     <CA...@mail.gmail.com>
:  <CA...@mail.gmail.com>
: Message-ID: <be...@bammers.net>
: In-Reply-To:
:     <CA...@mail.gmail.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.




-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.
It was called IndexWriter.ramSizeInBytes() in 4.10.3.

Le mer. 4 juil. 2018 à 15:35, Chris Bamford <ch...@bammers.net> a écrit :

> > IndexWriter.ramBytesUsed() gives you access to the current memory usage
> of
> > IndexWriter's buffers, but it can't tell you by how much it increased
> for a
> > given document assuming concurrent access to the IndexWriter.
> >
> Thanks, although I can’t find that API. Is there an equivalent call for
> Lucene 4.10.3?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.
> IndexWriter.ramBytesUsed() gives you access to the current memory usage of
> IndexWriter's buffers, but it can't tell you by how much it increased for a
> given document assuming concurrent access to the IndexWriter.
> 
Thanks, although I can’t find that API. Is there an equivalent call for Lucene 4.10.3?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.
IndexWriter.ramBytesUsed() gives you access to the current memory usage of
IndexWriter's buffers, but it can't tell you by how much it increased for a
given document assuming concurrent access to the IndexWriter.

Le mer. 4 juil. 2018 à 15:13, Chris Bamford <ch...@bammers.net> a écrit :

> Hello Adrien,
>
>
> >
> > There is no way to compute the byte size of a document.
>
> I feared that!
>
> > Also note that the
> > relationship between the size of a document and how much space it will
> use
> > in the Lucene index is quite complex.
> >
> I understand. I was wondering if there was maybe some sneaky way of
> peeking inside the IndexWriter before and after a write to compare buffer
> sizes?
>
> Thanks
> Chris
>
> > Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <ch...@bammers.net>
> a
> > écrit :
> >
> >> Hi there,
> >>
> >> How can I calculate the total size of a Lucene Document that I'm about
> >> to write to an index so I know how many bytes I am writing please?  I
> >> need it for some external metrics collection.
> >>
> >> Thanks
> >>
> >> - Chris
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.
Hello Adrien,


> 
> There is no way to compute the byte size of a document.

I feared that!

> Also note that the
> relationship between the size of a document and how much space it will use
> in the Lucene index is quite complex.
> 
I understand. I was wondering if there was maybe some sneaky way of peeking inside the IndexWriter before and after a write to compare buffer sizes?

Thanks 
Chris

> Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <ch...@bammers.net> a
> écrit :
> 
>> Hi there,
>> 
>> How can I calculate the total size of a Lucene Document that I'm about
>> to write to an index so I know how many bytes I am writing please?  I
>> need it for some external metrics collection.
>> 
>> Thanks
>> 
>> - Chris
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.
Hello,

There is no way to compute the byte size of a document. Also note that the
relationship between the size of a document and how much space it will use
in the Lucene index is quite complex.

Le mer. 4 juil. 2018 à 11:26, Chris and Helen Bamford <ch...@bammers.net> a
écrit :

> Hi there,
>
> How can I calculate the total size of a Lucene Document that I'm about
> to write to an index so I know how many bytes I am writing please?  I
> need it for some external metrics collection.
>
> Thanks
>
> - Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Adrien Grand <jp...@gmail.com>.
For the record, this is made even more complex by the fact that the disk
footprint of a document depends on other documents that are indexed nearby
in the same segment, and can change over merges.

Le jeu. 5 juil. 2018 à 08:22, Chris Bamford <ch...@bammers.net> a écrit :

> Yes I see, I originally missed Terry’s response which is probably the
> source of the confusion.
>
> So to clarify: I already know the size of the source document. As you say,
> this bears little resemblance to what actually gets written when indexed.
> It is this latter figure I was hoping to get.
>
> Thanks everyone.
>
> Chris
>
>
>
> > On 5 Jul 2018, at 03:31, Erick Erickson <er...@gmail.com> wrote:
> >
> > I think we're not talking about the same thing.
> >
> > You asked "How can I calculate the total size of a Lucene Document"...
> >
> > I was responding to the Terry's comment "In the document types I
> > usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> > called "stream_size" that contains the size of the document on disk. "
> >
> > Two totally different beasts. One is the source document, the other is
> > what you choose to put into the index from that document. Not to even
> > mention that you could, for instance, choose to index only the title
> > and throw everything else away so the size of the raw document on disk
> > doesn't seem useful for your case.
> >
> > Best,
> > Erick
> >
> >> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <ch...@bammers.net>
> wrote:
> >> Hi Erick
> >>
> >> Yes, size on disk is what I’m after as it will feed into an eventual
> calculation regarding actual bytes written (not interested in the source
> data document size, just real disk usage).
> >> Thanks
> >>
> >> Chris
> >>
> >> Sent from my iPhone
> >>
> >>> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com>
> wrote:
> >>>
> >>> But does size on disk help? If the doc has a zillion
> >>> images in it, those aren't part of the resulting index
> >>> (I'm excluding stored data here)....
> >>>
> >>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com>
> wrote:
> >>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> >>>> exists a metadata field called "stream_size" that contains the size of
> >>>> the document on disk.  You don't have to compute it.  Thus, when you
> >>>> retrieve each document you can pull out the contents of this field
> and,
> >>>> if you like, include it in each hitlist entry.
> >>>>
> >>>>
> >>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
> >>>>> Hi there,
> >>>>>
> >>>>> How can I calculate the total size of a Lucene Document that I'm
> about
> >>>>> to write to an index so I know how many bytes I am writing please?  I
> >>>>> need it for some external metrics collection.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> - Chris
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.
Yes I see, I originally missed Terry’s response which is probably the source of the confusion.

So to clarify: I already know the size of the source document. As you say, this bears little resemblance to what actually gets written when indexed. It is this latter figure I was hoping to get.

Thanks everyone.

Chris



> On 5 Jul 2018, at 03:31, Erick Erickson <er...@gmail.com> wrote:
> 
> I think we're not talking about the same thing.
> 
> You asked "How can I calculate the total size of a Lucene Document"...
> 
> I was responding to the Terry's comment "In the document types I
> usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> called "stream_size" that contains the size of the document on disk. "
> 
> Two totally different beasts. One is the source document, the other is
> what you choose to put into the index from that document. Not to even
> mention that you could, for instance, choose to index only the title
> and throw everything else away so the size of the raw document on disk
> doesn't seem useful for your case.
> 
> Best,
> Erick
> 
>> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <ch...@bammers.net> wrote:
>> Hi Erick
>> 
>> Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
>> Thanks
>> 
>> Chris
>> 
>> Sent from my iPhone
>> 
>>> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com> wrote:
>>> 
>>> But does size on disk help? If the doc has a zillion
>>> images in it, those aren't part of the resulting index
>>> (I'm excluding stored data here)....
>>> 
>>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
>>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>>> exists a metadata field called "stream_size" that contains the size of
>>>> the document on disk.  You don't have to compute it.  Thus, when you
>>>> retrieve each document you can pull out the contents of this field and,
>>>> if you like, include it in each hitlist entry.
>>>> 
>>>> 
>>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>>> Hi there,
>>>>> 
>>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>>> to write to an index so I know how many bytes I am writing please?  I
>>>>> need it for some external metrics collection.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> - Chris
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Erick Erickson <er...@gmail.com>.
I think we're not talking about the same thing.

You asked "How can I calculate the total size of a Lucene Document"...

I was responding to the Terry's comment "In the document types I
usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
called "stream_size" that contains the size of the document on disk. "

Two totally different beasts. One is the source document, the other is
what you choose to put into the index from that document. Not to even
mention that you could, for instance, choose to index only the title
and throw everything else away so the size of the raw document on disk
doesn't seem useful for your case.

Best,
Erick

On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <ch...@bammers.net> wrote:
> Hi Erick
>
> Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
> Thanks
>
> Chris
>
> Sent from my iPhone
>
>> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com> wrote:
>>
>> But does size on disk help? If the doc has a zillion
>> images in it, those aren't part of the resulting index
>> (I'm excluding stored data here)....
>>
>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>> exists a metadata field called "stream_size" that contains the size of
>>> the document on disk.  You don't have to compute it.  Thus, when you
>>> retrieve each document you can pull out the contents of this field and,
>>> if you like, include it in each hitlist entry.
>>>
>>>
>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>> Hi there,
>>>>
>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>> to write to an index so I know how many bytes I am writing please?  I
>>>> need it for some external metrics collection.
>>>>
>>>> Thanks
>>>>
>>>> - Chris
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Chris Bamford <ch...@bammers.net>.
Hi Erick

Yes, size on disk is what I’m after as it will feed into an eventual calculation regarding actual bytes written (not interested in the source data document size, just real disk usage).
Thanks 

Chris

Sent from my iPhone

> On 4 Jul 2018, at 17:08, Erick Erickson <er...@gmail.com> wrote:
> 
> But does size on disk help? If the doc has a zillion
> images in it, those aren't part of the resulting index
> (I'm excluding stored data here)....
> 
>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>> exists a metadata field called "stream_size" that contains the size of
>> the document on disk.  You don't have to compute it.  Thus, when you
>> retrieve each document you can pull out the contents of this field and,
>> if you like, include it in each hitlist entry.
>> 
>> 
>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>> Hi there,
>>> 
>>> How can I calculate the total size of a Lucene Document that I'm about
>>> to write to an index so I know how many bytes I am writing please?  I
>>> need it for some external metrics collection.
>>> 
>>> Thanks
>>> 
>>> - Chris
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Erick Erickson <er...@gmail.com>.
But does size on disk help? If the doc has a zillion
images in it, those aren't part of the resulting index
(I'm excluding stored data here)....

On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <te...@net-frame.com> wrote:
> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> exists a metadata field called "stream_size" that contains the size of
> the document on disk.  You don't have to compute it.  Thus, when you
> retrieve each document you can pull out the contents of this field and,
> if you like, include it in each hitlist entry.
>
>
> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>> Hi there,
>>
>> How can I calculate the total size of a Lucene Document that I'm about
>> to write to an index so I know how many bytes I am writing please?  I
>> need it for some external metrics collection.
>>
>> Thanks
>>
>> - Chris
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Size of Document

Posted by Terry Steichen <te...@net-frame.com>.
In the document types I usually index (.pdf, .docx/.doc, .eml), there
exists a metadata field called "stream_size" that contains the size of
the document on disk.  You don't have to compute it.  Thus, when you
retrieve each document you can pull out the contents of this field and,
if you like, include it in each hitlist entry.


On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
> Hi there,
>
> How can I calculate the total size of a Lucene Document that I'm about
> to write to an index so I know how many bytes I am writing please?  I
> need it for some external metrics collection.
>
> Thanks
>
> - Chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org