You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by Bogdan Graur <bg...@gmail.com> on 2012/09/05 10:54:03 UTC

Re: Partial object caching

>
> > 1. Is there a one to many relation between a cached object and a Doc
> > structure?
>
> Yes, that's the chained Docs in the lower right. Each Doc represents a
> fragment and we have discussed previously how objects are stored as an
> ordered set of fragments.
>

Just for clarification: I did some tests and it seems the entries in the
"alternate vector" stored in the first "Doc" are different versions of the
same cached object? (I see that these entries are scanned for the best
quality match and the entry with the highest score and most recent is used)
So actually "AltVec" is a vector of linked lists, each linked list being a
different version of a cached object?

As the cached object is represented with a linked list of fragments does
that mean that in order to serve a specific range from it we have to read
all fragments (presumably from disk) for the next_key until we get to the
signifficant fragment? Or maybe again I'm missing something ..


> > 2. A Doc structure contains a table of fragments which are represented as
> > uint64_t offsets.
>
> The fragment offsets are the location in the object of the fragment data.
> frag_offset[i] is the address of the first byte past the end of fragment i.
> To simplify, presume fragments are at most 100 bytes long and we have an
> object that is 390 bytes long in four fragments (0,1,2,3). Then
> frag_offset[1] could be 201 which would mean the next byte past the end of
> fragment 1 is byte 201 of 390 in the original object (or equivalently that
> the first byte of fragment 2 is byte 201 of 390 in the object). This data
> is used to accelerate range requests so that if you had a request for bytes
> 220-300 of the object you could skip immediately to fragment 2 without
> reading fragment 1. In real life there are some very large (20M+) files out
> there and being able to skip reading the first 10M or so from disk is an
> enormous performance improvement. Note the earliest doc for an alternate is
> always read because of the way the logic is done now, although in theory
> that could be skipped because you the fragment data you need is in the
> First Doc which doesn't contain any object data for objects with multiple
> alternates or that are more than 1 fragment long.
>
> From this description it seems to me that these "fragments inside a
fragment" resemble to the chunks you describe in your original e-mail.
The differences are:
- currently these chunks may be of different sizes as opposed to the
proposed new model of having equally sized chunks
- there is  currently no bitfield to say which chunks are valid
Are these "fragments inside a fragment" currently used for anything else
than range request acceleration?

Best regards,
Bogdan Graur

Re: Partial object caching

Posted by Bogdan Graur <bg...@gmail.com>.
No more feedback from you guys since one week.

Which solution do you consider better? The chunk validity bitmap, the
start/end span scheme or maybe another idea?


On Wed, Sep 12, 2012 at 1:30 AM, Bogdan Graur <bg...@gmail.com> wrote:

>
> > Can we also consider the other way: make the request to the origin server
>> > with a larger range so that we may 'join' two disjoint parts of the
>> data?
>> > (try to avoid having many empty chunks in between filled chunks)
>>
>> We can consider it but I think it won't work. It would require, among
>> other things, cleverness in changing the outgoing request *and* tweaking
>> the response so that the client gets the originally requested range.
>> Unlikely to be worth it - better to do something in the nature of the other
>> bug you mentioned where, in a plugin, completely synthetic range request
>> are generated to fill gaps.
>>
> You're right. Changing the request and the response is not the safest
> thing to do.
> The idea to generate synthetic range requests from a plugin to fill in
> gaps sounds very good if you say it can be done.
>
>
>> > The solution with the chunk validity bitmap has the advantage that it
>> works
>> > fine without throwing away cached data.
>> > Does throwing away parts of a cached object while updating the cache for
>> > the same object raise any synchronization issues?
>>
>> No, because you don't really throw away the data, you simply change the
>> valid span values in the fragment. Reading and writing those have to be
>> synchronized for other reasons so it's no additional cost.
>>
>> Which is better may well depend on our anticipated access pattern. It's
>> easy to think of a plausible pattern where a client "rolls through" an
>> object via range requests (a very common pattern in practice) and each
>> range request spans chunks without every covering a chunk so you don't
>> actually store anything from the entire set of requests, even though ATS
>> saw the entire object. The span scheme, OTOH, would end up capturing the
>> entire object.
>>
>
> This "roll through" use case might indeed be the common pattern.
>
> The chunk validity solution behavior for this case:
> In the worse case, as you commented, after reading the whole object we
> will not be able to cache anything from the object.
> In the average case (when a range request is bigger than the size of the
> chunk) the cached object will have many not filled chunks and we might have
> to do a lot of small range request to fill these gaps.
>
> The span scheme behavior is ideal in both the average and worst case
> scenario.
>
> Other use cases:
> - "roll through" a part of the object then skip a portion of a file and
> then "roll through" again (and so on): the span scheme is again the better
> one as this is a generalization of the common pattern.
> - make many "disjoint" range requests: the two solutions would probably be
> equally good with a plus on the side of the chunk validity bitmap solution
> as it might cache more data.
>
> Any other use cases I have missed?
>
>
>
>
>

Re: Partial object caching

Posted by Bogdan Graur <bg...@gmail.com>.
> > Can we also consider the other way: make the request to the origin server
> > with a larger range so that we may 'join' two disjoint parts of the data?
> > (try to avoid having many empty chunks in between filled chunks)
>
> We can consider it but I think it won't work. It would require, among
> other things, cleverness in changing the outgoing request *and* tweaking
> the response so that the client gets the originally requested range.
> Unlikely to be worth it - better to do something in the nature of the other
> bug you mentioned where, in a plugin, completely synthetic range request
> are generated to fill gaps.
>
You're right. Changing the request and the response is not the safest thing
to do.
The idea to generate synthetic range requests from a plugin to fill in gaps
sounds very good if you say it can be done.


> > The solution with the chunk validity bitmap has the advantage that it
> works
> > fine without throwing away cached data.
> > Does throwing away parts of a cached object while updating the cache for
> > the same object raise any synchronization issues?
>
> No, because you don't really throw away the data, you simply change the
> valid span values in the fragment. Reading and writing those have to be
> synchronized for other reasons so it's no additional cost.
>
> Which is better may well depend on our anticipated access pattern. It's
> easy to think of a plausible pattern where a client "rolls through" an
> object via range requests (a very common pattern in practice) and each
> range request spans chunks without every covering a chunk so you don't
> actually store anything from the entire set of requests, even though ATS
> saw the entire object. The span scheme, OTOH, would end up capturing the
> entire object.
>

This "roll through" use case might indeed be the common pattern.

The chunk validity solution behavior for this case:
In the worse case, as you commented, after reading the whole object we will
not be able to cache anything from the object.
In the average case (when a range request is bigger than the size of the
chunk) the cached object will have many not filled chunks and we might have
to do a lot of small range request to fill these gaps.

The span scheme behavior is ideal in both the average and worst case
scenario.

Other use cases:
- "roll through" a part of the object then skip a portion of a file and
then "roll through" again (and so on): the span scheme is again the better
one as this is a generalization of the common pattern.
- make many "disjoint" range requests: the two solutions would probably be
equally good with a plus on the side of the chunk validity bitmap solution
as it might cache more data.

Any other use cases I have missed?

Re: Partial object caching

Posted by "Alan M. Carroll" <am...@network-geographics.com>.
Monday, September 10, 2012, 5:14:40 PM, you wrote:

> Can we also consider the other way: make the request to the origin server
> with a larger range so that we may 'join' two disjoint parts of the data?
> (try to avoid having many empty chunks in between filled chunks)

We can consider it but I think it won't work. It would require, among other things, cleverness in changing the outgoing request *and* tweaking the response so that the client gets the originally requested range. Unlikely to be worth it - better to do something in the nature of the other bug you mentioned where, in a plugin, completely synthetic range request are generated to fill gaps.


> The solution with the chunk validity bitmap has the advantage that it works
> fine without throwing away cached data.
> Does throwing away parts of a cached object while updating the cache for
> the same object raise any synchronization issues?

No, because you don't really throw away the data, you simply change the valid span values in the fragment. Reading and writing those have to be synchronized for other reasons so it's no additional cost.

Which is better may well depend on our anticipated access pattern. It's easy to think of a plausible pattern where a client "rolls through" an object via range requests (a very common pattern in practice) and each range request spans chunks without every covering a chunk so you don't actually store anything from the entire set of requests, even though ATS saw the entire object. The span scheme, OTOH, would end up capturing the entire object.


Re: Partial object caching

Posted by Bogdan Graur <bg...@gmail.com>.
> Your write up seems basically accurate to me. I would note again that
> reading the earliest Doc of an alternate is an historical artifact of the
> implementation and not at all a functional requirement.
>

I see. I'm glad you cleared this out too. Initially I thought you planned
to store the fragment table in this earliest Doc. I checked the trunk and
saw it is stored in fact in the alternate entry itself (HTTPCacheAlt
structure)

The other half of the proposal is that, when servicing a range request that
> has been forwarded to the origin server, we cache fill any complete chunks
> that are returned and ignore any partial chunks.

That's an elegant solution for keeping the chunk validity bitmap consistent!
Can we also consider the other way: make the request to the origin server
with a larger range so that we may 'join' two disjoint parts of the data?
(try to avoid having many empty chunks in between filled chunks)


> This is the point of having chunks, to make it much more likely that we
> can capture data from range requests. We could do that with just fragments
> but IMHO the granularity of that is too coarse. It might be interesting as
> a first step to try it to validate the partial write logic.
>
> There is also the issue of how much compile time and/or run time control
> should be provided over the graininess of the chunks. IMHO it would be
> reasonable to support compile time control over a value N such that each
> fragment consists of 2^N chunks.
>
I think this can be done.


> The fragment tables changes are now on trunk, committed under the TS-1339
> bug ID.
>
Thanks for pointing that out.


>
> P.S. An alternative scheme would be to store for each fragment store a
> valid span (two values, start/end) and then extend that if a range request
> has an overlap. You could even, in the disjoint case, compute which span
> had more data and keep that (e.g., effectively discard cached data if you
> can replace it with a larger span in the fragment).
>

The solution with the chunk validity bitmap has the advantage that it works
fine without throwing away cached data.
Does throwing away parts of a cached object while updating the cache for
the same object raise any synchronization issues?




>
> Sunday, September 9, 2012, 5:37:29 PM, you wrote:
>
> > Thank you very much for the clarifications!
>
> > I think now I have a better view on the current cache structure and your
> > proposal.
>
> > Who should also validate these changes before going to the next step
> which
> > I think would be the strategy of what and when to cache?
>
> > PS: is there a branch already available with your change to move the
> > fragment table from the first Doc to the alternate entries?
>
>

Re: Partial object caching

Posted by "Alan M. Carroll" <am...@network-geographics.com>.
Your write up seems basically accurate to me. I would note again that reading the earliest Doc of an alternate is an historical artifact of the implementation and not at all a functional requirement.

The other half of the proposal is that, when servicing a range request that has been forwarded to the origin server, we cache fill any complete chunks that are returned and ignore any partial chunks. This is the point of having chunks, to make it much more likely that we can capture data from range requests. We could do that with just fragments but IMHO the granularity of that is too coarse. It might be interesting as a first step to try it to validate the partial write logic.

There is also the issue of how much compile time and/or run time control should be provided over the graininess of the chunks. IMHO it would be reasonable to support compile time control over a value N such that each fragment consists of 2^N chunks.

The fragment tables changes are now on trunk, committed under the TS-1339 bug ID.

P.S. An alternative scheme would be to store for each fragment store a valid span (two values, start/end) and then extend that if a range request has an overlap. You could even, in the disjoint case, compute which span had more data and keep that (e.g., effectively discard cached data if you can replace it with a larger span in the fragment).

Sunday, September 9, 2012, 5:37:29 PM, you wrote:

> Thank you very much for the clarifications!

> I think now I have a better view on the current cache structure and your
> proposal.

> Who should also validate these changes before going to the next step which
> I think would be the strategy of what and when to cache?

> PS: is there a branch already available with your change to move the
> fragment table from the first Doc to the alternate entries?


Re: Partial object caching

Posted by Bogdan Graur <bg...@gmail.com>.
Thank you very much for the clarifications!

I think now I have a better view on the current cache structure and your
proposal.

Current flow for a range request:
1. Find the first Doc for the cached object.
2. The first Doc contains the alternate vector and the fragment table (of
course the fragment table belongs to the entries in the alternate and you
said you will do this change soon)
3. Find the best matching alternate entry and use that as a representation
of the cached object in the next steps.
4. Read the Doc corresponding to the best matching alternate.
5. Compute the needed fragments for serving the requested range.
6. Read the needed fragments and serve the response.

In the current state if an object is cached that means its whole data is
available from the cache.

With your proposal:
1. Find the first Doc for the cached object.
2. The first Doc contains the alternate vector.
3. Find the best matching alternate entry and use that as a representation
of the cached object in the next steps.
4. Read the Doc corresponding to the best matching alternate.[with your
change we will have available here the fragment table I presume]
5. Compute the needed fragments and verify that all data is available
(check corresponding chunk validity info that should be located in the
fragment table)
6. Read the needed fragments and serve the response if everything is
available.

Considering that your proposal fully supports partially cached objects
without any performance hit my opinion is that it would be a good
representation for our needs.
The changes to the cache structure are also kept to a minimum.

Who should also validate these changes before going to the next step which
I think would be the strategy of what and when to cache?

PS: is there a branch already available with your change to move the
fragment table from the first Doc to the alternate entries?

Best regards,
Bogdan Graur


On Thu, Sep 6, 2012 at 4:15 AM, Alan M. Carroll <amc@network-geographics.com
> wrote:

> Wednesday, September 5, 2012, 4:16:42 PM, you wrote:
>
> > Are all Doc structures of the same size?
>
> Mostly, but you don't need their actual size, only the size of the content
> per Doc instance, which is implicit in the fragment offset table. Given a
> range request we can walk the fragment offset table (which was read with
> the First/Earliest Doc) until we find the fragment which contains the
> starting byte of the range and being loading fragments with that one.
>
> > So the fragments table entry will be augmented with a new uint64_t field
> > which holds the chunk validity information.
>
> Right.
>
> > But after computing all the necessary "Doc" structures for satisfying a
> > range request they still have to be read from disk (not the content, just
> > the first part containing the fragments table) to check for chunk
> validity.
>
> No. The fragment offset table will be moved in to the alternate header,
> which is stored in First Doc. Once that is read you have all of the
> fragment offset and chunk validity data in memory. The entire table, with
> all offsets and validity bitmaps, is a single chunk of data in the
> alternate header.
>
> Again, this is just a proposal to stimulate discussion, it is *not* a
> committed plan of action.
>
> > Only when we know we have the complete range valid we may serve the range
> > from the cache.
>
> Yes, and that can be computed from in memory data as soon as the alternate
> is selected.
>
>

Re: Partial object caching

Posted by "Alan M. Carroll" <am...@network-geographics.com>.
Wednesday, September 5, 2012, 4:16:42 PM, you wrote:

> Are all Doc structures of the same size?

Mostly, but you don't need their actual size, only the size of the content per Doc instance, which is implicit in the fragment offset table. Given a range request we can walk the fragment offset table (which was read with the First/Earliest Doc) until we find the fragment which contains the starting byte of the range and being loading fragments with that one.
 
> So the fragments table entry will be augmented with a new uint64_t field
> which holds the chunk validity information.

Right.

> But after computing all the necessary "Doc" structures for satisfying a
> range request they still have to be read from disk (not the content, just
> the first part containing the fragments table) to check for chunk validity.

No. The fragment offset table will be moved in to the alternate header, which is stored in First Doc. Once that is read you have all of the fragment offset and chunk validity data in memory. The entire table, with all offsets and validity bitmaps, is a single chunk of data in the alternate header.

Again, this is just a proposal to stimulate discussion, it is *not* a committed plan of action.

> Only when we know we have the complete range valid we may serve the range
> from the cache.

Yes, and that can be computed from in memory data as soon as the alternate is selected.


Re: Partial object caching

Posted by Bogdan Graur <bg...@gmail.com>.
>
>
> No. The keys are computationally linked. Given a key for a fragment, the
> key for the next fragment is computed, not read.


OK, I think I found the way the next_key is computed from a given key.


> Therefore you can compute the key for fragment i from the key for fragment
> 0 without I/O.
>

Are all Doc structures of the same size? If they differ in size and we do
not have somewhere in RAM (or in the first fragment) each of their sizes
how can we get away without any I/O for the non relevant Docs when serving
a range request?


> Hmmm, that would be a problem - how can you verify the entire range if it
> spans fragments, as there may by be missing chunks in the middle? You don't
> want to start serving and then discover a missing chunk. Ah, that's why the
> validity bitmaps would be stored in the fragment table, so all of that is
> available in memory before reading any fragments.
>

So the fragments table entry will be augmented with a new uint64_t field
which holds the chunk validity information.
But after computing all the necessary "Doc" structures for satisfying a
range request they still have to be read from disk (not the content, just
the first part containing the fragments table) to check for chunk validity.
Only when we know we have the complete range valid we may serve the range
from the cache.

Thank you very much for pointing me to the "Traffic Server Programming
Guide"!

Re: Partial object caching

Posted by "Alan M. Carroll" <am...@network-geographics.com>.
Wednesday, September 5, 2012, 3:54:03 AM, you wrote:

> Just for clarification: I did some tests and it seems the entries in the
> "alternate vector" stored in the first "Doc" are different versions of the
> same cached object?

Yes, see here:
http://trafficserver.apache.org/docs/trunk/sdk/http-hooks-and-transactions/http-alternate-selection.en.html
 
> As the cached object is represented with a linked list of fragments does
> that mean that in order to serve a specific range from it we have to read
> all fragments (presumably from disk) for the next_key until we get to the
> signifficant fragment?

No. The keys are computationally linked. Given a key for a fragment, the key for the next fragment is computed, not read. Therefore you can compute the key for fragment i from the key for fragment 0 without I/O.

> From this description it seems to me that these "fragments inside a
> fragment" resemble to the chunks you describe in your original e-mail.

No, the chunks in my original e-mail would be subranges of these fragments. Current fragments are effectively all the same size except the last fragment. "Chunks" are not currently used for anything as they do not exist in the current implementation. If implemented as suggested they would be more notional, effectively being stored only as the validity bitmap per fragment. Disk I/O would continue to be per fragment. Fragments exist to limit disk I/O per I/O operation. The range acceleration was bolted on by adding the fragment offset tables.

Hmmm, that would be a problem - how can you verify the entire range if it spans fragments, as there may by be missing chunks in the middle? You don't want to start serving and then discover a missing chunk. Ah, that's why the validity bitmaps would be stored in the fragment table, so all of that is available in memory before reading any fragments.