You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@trafficserver.apache.org by Rasim Saltuk Alakuş <ra...@turksat.com.tr> on 2014/08/27 18:17:17 UTC

generating hash from packet content

Hi All,

ATS uses URL hash for cache storage. And CacheUrl plugin adds some more flexibility in URL hashing strategy.

We think of creating hash based on packet content and use it as the hash while storing and retrieving from cache This looks a better solution, so that URI changes won't hurt caching system. One immediate benefit for example if you cache YouTube , each request for same video can have different URL and CacheUrl plugin does not always provide a good solution. Also maintaining site based hash filters looks not an elegant solution.

Is there any previous or active work for implementing content based hashing? What kind of problems and constrains you may guess. Is there any volunteer to implement this feature together with us?

Kind regards
Saltuk Alakuş

Rasim Saltuk Alakuş

Kıdemli Uzman
Senior Specialist
Bilişim Ar-Ge ve Teknoloji Direktörlüğü
IT R & D and Technology

www.turksat.com.tr<http://www.turksat.com.tr>
ralakus@turksat.com.tr<ma...@turksat.com.tr>

TUNA MAH. İSMAİL ÖZKUNT 1709 SOK. NO:3 KAT:2 KARŞIYAKA – İZMİR
T : +90 232 323 43 00
F : +90 232 323 43 44

"Bu mesaj ve ekleri mesajda gönderildiği belirtilen kişi ya da kişilere özel olup gizli bilgiler içeriyor olabilir. Mesajın muhatabı ilgilisi ya da gönderileni değilseniz lütfen mesajı herhangi bir şekilde kullanmayınız çoğaltmayınız ve başkalarına ifşa etmeyiniz. Eğer mesaj yanlışlıkla size ulaşmışsa anılan mesaj ve ekinde yer alan bilgileri gizli tutunuz ve mesajı gönderen kişiyi bilgilendirerek söz konusu mesaj ile eklerini derhal imha ediniz. Bu mesaj ve ekindeki belgelerin bilinen virüslere karşı kontrolü yapılmıştır. Ancak e-posta sistemlerinin taşıdığı risklerden dolayı şirketimiz bu mesajın ve içerdiği bilgilerin size değişikliğe uğrayarak veya geç ulaşmasından bütünlüğünün ve gizliliğinin korunamamasından virüs içermesinden ve herhangi bir sebeple bilgisayarınıza ve sisteminize verebileceği zararlardan sorumlu tutulamaz.”<<<<<

“This message together with its attachments is intended solely for the address(es) and may contain confidential or privileged information. If you are not the intended recipient please do not use copy or disclose the message for any purpose. Should you receive this message by mistake please keep all information contained in the message or its attachments strictly confidential and advise the sender and delete it immediately without retaining a copy. This message and its attachments have been swept by anti-virus systems for the presence of known viruses. However due to the risks of e-mail systems our company cannot accept liability for any changes or delay in receiving loss of integrity and confidentiality containing viruses and any damages caused in any way to your computer and system recipient, you are notified that disclosing, distributing, or copying this e-mail is strictly prohibited. “

RE: generating hash from packet content

Posted by Luca Rea <lu...@contactlab.com>.

Ok, what about using a second ATS (parent of the first one) with a small storage as buffering system?

Re: generating hash from packet content

Posted by "Alan M. Carroll" <am...@network-geographics.com>.

Luca,

Monday, September 1, 2014, 1:49:13 PM, you wrote:

> You can also choose to store it in cache and delete dupe in a second moment.

You could, but once you've written to cache, you have advanced the write cursor. Deleting the duplicate later has no effect except changing a directory entry. To have any benefit, you need to detect it as a duplicate before writing it to disk.

Re: generating hash from packet content

Posted by "Alan M. Carroll" <am...@network-geographics.com>.

Tuesday, September 2, 2014, 3:04:55 AM, you wrote:

> How does ATS manage the expired objecs? If an object has expired/removed , I assume allocated space for it can later be used by other objects.

ATS doesn't have allocated space in the cache. It's a circular buffer. All objects, including expired ones, are eventually overwritten. There is no other space management.

> PS: I am very new to ATS, and the link you provided is quite low level. So above is my basic expectation at best from a file system like structure.

Well, there isn't any higher level to the ATS cache. That low level description is effectively complete description. There is a write cursor, and ATS writes to cache at that location, advancing (circularly) as it writes. The directory keeps track of what has been written but data is always written at the write cursor.

RE: generating hash from packet content

Posted by Rasim Saltuk Alakuş <ra...@turksat.com.tr>.

Hi Allan,

How does ATS manage the expired objecs? If an object has expired/removed , I assume allocated space for it can later be used by other objects. 

Total size is pre-allocated for ATS, but using the allocated space effectively is still an optimization. So I assume removing the duplicated data will do it.

If my understanding is correct, I like to implement this; Write all cachable content to disk when they arrive and a background low priority process will check for dublicates and remove them. Do you think is this possible?

PS: I am very new to ATS, and the link you provided is quite low level. So above is my basic expectation at best from a file system like structure.

Kind regards
Saltuk Alakuş 







-----Original Message-----
From: Alan M. Carroll [mailto:amc@network-geographics.com] 
Sent: Tuesday, September 02, 2014 1:53 AM
To: dev@trafficserver.apache.org
Subject: Re: generating hash from packet content

Luca,

Monday, September 1, 2014, 1:49:13 PM, you wrote:

> You can also choose to store it in cache and delete dupe in a second moment.

You could, but once you've written to cache, you have advanced the write cursor. Deleting the duplicate later has no effect except changing a directory entry. To have any benefit, you need to detect it as a duplicate before writing it to disk.

RE: generating hash from packet content

Posted by Luca Rea <lu...@contactlab.com>.

You can also choose to store it in cache and delete dupe in a second moment.

Re: generating hash from packet content

Posted by Leif Hedstrom <zw...@apache.org>.

I assume the idea would be to not write to the cache until you have decided it is not a dupe? Which would imply, buffering it somewhere which does not move the write header forward.

-- Leif 

> On Sep 1, 2014, at 9:55 AM, Luca Rea <lu...@contactlab.com> wrote:
> 
> Can it help to keep only the useful cache and limit the recycle process?

RE: generating hash from packet content

Posted by Luca Rea <lu...@contactlab.com>.

Can it help to keep only the useful cache and limit the recycle process?

Re: generating hash from packet content

Posted by "Alan M. Carroll" <am...@network-geographics.com>.

Rasim,

Monday, September 1, 2014, 10:22:06 AM, you wrote:

> Looks like it is not feasible/possible to remove URL hash map solution completely.. However, storage optimization was another topic in our mind, which content hash can save the day. We are thinking this can be a nice feature if implemented. If anybody decide to implement please let us know.

How would it optimize storage? ATS uses all of its assigned storage at all times. Because of how the cache works, removing duplicates would have no effect on ATS's use of cache storage. Please see this link for some introductory detail

https://docs.trafficserver.apache.org/en/latest/arch/cache/cache-arch.en.html#storage-layout

RE: generating hash from packet content

Posted by Rasim Saltuk Alakuş <ra...@turksat.com.tr>.

Hi All,

Thanks for your kind replies. 

Looks like it is not feasible/possible to remove URL hash map solution completely.. However, storage optimization was another topic in our mind, which content hash can save the day. We are thinking this can be a nice feature if implemented. If anybody decide to implement please let us know.


regards
Saltuk







________________________________________
From: Rasim Saltuk Alakuş
Sent: Wednesday, August 27, 2014 7:17 PM
To: dev@trafficserver.apache.org; users@trafficserver.apache.org
Subject: generating hash from packet content

Hi All,

ATS uses URL hash for cache storage. And CacheUrl plugin adds some more flexibility in URL hashing strategy.

We think of creating hash based on packet content and use it as the hash while storing and retrieving from cache This looks a better solution, so that URI changes won't hurt caching system. One immediate benefit for example if you cache YouTube , each request for same video can have different URL and CacheUrl plugin does not always provide a good solution. Also maintaining site based hash filters looks not an elegant solution.

Is there any previous or active work for implementing content based hashing? What kind of problems and constrains you may guess. Is there any volunteer to implement this feature together with us?

Kind regards
Saltuk Alakuş

Re: generating hash from packet content

Posted by Susan Hinrichs <sh...@network-geographics.com>.

I've been thinking about it recently.  But it hasn't come to the top of 
my priority queue yet.

Some of the content providers are making the mapping of fixed asset ID 
to URL more and more obscure, so ultimately a hash-based solution 
becomes necessary.  You pay some on startup costs (having to fetch a 
fixed portion of the data to compute the hash), but as you point out the 
hash-based solution is more resilient in the face of URL changes and/or 
content changes.

This paper from Alcatel describes the trade-offs nicely 
http://www.nctatechnicalpapers.com/Paper/2014/2014-application-of-policy-based-indexes-and-unified-caching-for-content-delivery/download

 From my thinking so far, the hash solution would require partial object 
based caching.  Unless you are concentrating on very small content, you 
don't want to force grabbing the whole asset at once for performance 
reasons.  Or you want to be fetching at the same rate as the client 
requestor.  So if the client requests data in 10 second increments, 
you'll want to store those increments as they come in rather than 
requesting all 30 minutes (or 2 hours or more) up front.

Alan has a plan for adding partial object caching which has been 
discussed at one of the summits in the past year.
http://network-geographics.com/ats/docs/partial-object-caching.en.html.


On 8/27/2014 11:17 AM, Rasim Saltuk Alakuş wrote:
> Hi All,
>
> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more flexibility in URL hashing strategy.
>
> We think of creating hash based on packet content and use it as the hash while storing and retrieving from cache This looks a better solution, so that URI changes won't hurt caching system. One immediate benefit for example if you cache YouTube , each request for same video can have different URL and CacheUrl plugin does not always provide a good solution. Also maintaining site based hash filters looks not an elegant solution.
>
> Is there any previous or active work for implementing content based hashing? What kind of problems and constrains you may guess. Is there any volunteer to implement this feature together with us?
>
> Kind regards
> Saltuk Alakuş
>
> Rasim Saltuk Alakuş
>
> Kıdemli Uzman
> Senior Specialist
> Bilişim Ar-Ge ve Teknoloji Direktörlüğü
> IT R & D and Technology
>
> www.turksat.com.tr<http://www.turksat.com.tr>
> ralakus@turksat.com.tr<ma...@turksat.com.tr>
>
> TUNA MAH. İSMAİL ÖZKUNT 1709 SOK. NO:3 KAT:2 KARŞIYAKA – İZMİR
> T : +90 232 323 43 00
> F : +90 232 323 43 44
>
>
>
>
>
> "Bu mesaj ve ekleri mesajda gönderildiği belirtilen kişi ya da kişilere özel olup gizli bilgiler içeriyor olabilir. Mesajın muhatabı ilgilisi ya da gönderileni değilseniz lütfen mesajı herhangi bir şekilde kullanmayınız çoğaltmayınız ve başkalarına ifşa etmeyiniz. Eğer mesaj yanlışlıkla size ulaşmışsa anılan mesaj ve ekinde yer alan bilgileri gizli tutunuz ve mesajı gönderen kişiyi bilgilendirerek söz konusu mesaj ile eklerini derhal imha ediniz. Bu mesaj ve ekindeki belgelerin bilinen virüslere karşı kontrolü yapılmıştır. Ancak e-posta sistemlerinin taşıdığı risklerden dolayı şirketimiz bu mesajın ve içerdiği bilgilerin size değişikliğe uğrayarak veya geç ulaşmasından bütünlüğünün ve gizliliğinin korunamamasından virüs içermesinden ve herhangi bir sebeple bilgisayarınıza ve sisteminize verebileceği zararlardan sorumlu tutulamaz.”<<<<<
>
> “This message together with its attachments is intended solely for the address(es) and may contain confidential or privileged information. If you are not the intended recipient please do not use copy or disclose the message for any purpose. Should you receive this message by mistake please keep all information contained in the message or its attachments strictly confidential and advise the sender and delete it immediately without retaining a copy. This message and its attachments have been swept by anti-virus systems for the presence of known viruses. However due to the risks of e-mail systems our company cannot accept liability for any changes or delay in receiving loss of integrity and confidentiality containing viruses and any damages caused in any way to your computer and system recipient, you are notified that disclosing, distributing, or copying this e-mail is strictly prohibited. “
>
>

Re: generating hash from packet content

Posted by Bill Zeng <bi...@gmail.com>.

Just as a side question, do we have statistics on the extent of duplication
we have on ATS cache? Say, how many URL's point to the same object on
average? It seems like a trade-off between duplication and computation
(space and time).



On Wed, Aug 27, 2014 at 1:22 PM, Leif Hedstrom <zw...@apache.org> wrote:

> On Aug 27, 2014, at 1:51 PM, Nick Kew <ni...@apache.org> wrote:
>
> > On Wed, 27 Aug 2014 16:17:17 +0000
> > Rasim Saltuk Alakuş <ra...@turksat.com.tr> wrote:
> >
> >>
> >> Hi All,
> >>
> >> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more
> flexibility in URL hashing strategy.
> >>
> >> We think of creating hash based on packet content and use it as the
> hash while storing and retrieving from cache This looks a better solution,
> so that URI changes won't hurt caching system. One immediate benefit for
> example if you cache YouTube , each request for same video can have
> different URL and CacheUrl plugin does not always provide a good solution.
> Also maintaining site based hash filters looks not an elegant solution.
> >>
> >> Is there any previous or active work for implementing content based
> hashing? What kind of problems and constrains you may guess. Is there any
> volunteer to implement this feature together with us?
> >
> >
> > Indeed, the whole scheme is BAD (Broken As Designed).
> > Using different URLs for common content breaks cacheing on
> > the Web at large, and hacking one agent (such as Trafficserver)
> > to work around it will gain you only a tiny fraction of what
> > you've thrown away.  Indeed, if every agent on the Web -
> > from origin servers to desktop browsers - implemented this
> > cacheing scheme, you'd still lose MOST of the benefits of
> > cacheing, as the same content passes through different paths.
>
>
>
> I thought some more on this over a boring meeting, two more thoughts comes
> to mind:
>
> 1) Cache poisoning. This could be a serious problem, at a minimum some
> defenses such as using the Host: portion of the request for the cache key
> would be required. But, I’m guessing that still would be possible to abuse,
> to poison the HTTP caches (since the client request + origin response
> headers no longer dictates the cache lookup).
>
> 2) HTTP/2. Albeit it supports non-TLS, several browser vendors have
> indicated they will not support H2 over plain text. So, assuming we’re
> moving towards TLS across the board, this sort of interaction will get more
> tricky. I personally think it’ll have to evolve in a way that the content
> owners will need to participate better with caches. It’s too early to say,
> but maybe such a proposal would encourage the YouTube’s and Netflix’es to
> behave better (in some way that they can still control content, ad
> impressions, click tracking etc. etc. yet allow ISPs to cache the actual
> content).
>
> Just my $0.01,
>
>
> — Leif
>
>

Re: generating hash from packet content

Posted by Susan Hinrichs <sh...@network-geographics.com>.

On 8/27/2014 3:22 PM, Leif Hedstrom wrote:
> On Aug 27, 2014, at 1:51 PM, Nick Kew <ni...@apache.org> wrote:
>
>> On Wed, 27 Aug 2014 16:17:17 +0000
>> Rasim Saltuk Alakuş <ra...@turksat.com.tr> wrote:
>>
>>> Hi All,
>>>
>>> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more flexibility in URL hashing strategy.
>>>
>>> We think of creating hash based on packet content and use it as the hash while storing and retrieving from cache This looks a better solution, so that URI changes won't hurt caching system. One immediate benefit for example if you cache YouTube , each request for same video can have different URL and CacheUrl plugin does not always provide a good solution. Also maintaining site based hash filters looks not an elegant solution.
>>>
>>> Is there any previous or active work for implementing content based hashing? What kind of problems and constrains you may guess. Is there any volunteer to implement this feature together with us?
>>
>> Indeed, the whole scheme is BAD (Broken As Designed).
>> Using different URLs for common content breaks cacheing on
>> the Web at large, and hacking one agent (such as Trafficserver)
>> to work around it will gain you only a tiny fraction of what
>> you've thrown away.  Indeed, if every agent on the Web -
>> from origin servers to desktop browsers - implemented this
>> cacheing scheme, you'd still lose MOST of the benefits of
>> cacheing, as the same content passes through different paths.
>
>
> I thought some more on this over a boring meeting, two more thoughts comes to mind:
>
> 1) Cache poisoning. This could be a serious problem, at a minimum some defenses such as using the Host: portion of the request for the cache key would be required. But, I’m guessing that still would be possible to abuse, to poison the HTTP caches (since the client request + origin response headers no longer dictates the cache lookup).

Good point on the cache poisoning.  If the attacker knew your hash 
generation strategy (e.g. hash the first 1000 bytes of the file) and had 
access to a legitimate copy of that data, he could indeed inject bogus 
data for the non hashed data.

Given the large number of potential hosts for a CDN, I think you want to 
generalize the host name before you add it to the look up key.  If the 
host name matches your expectations for a CDN, you can use a fixed name 
as part of the key.  Otherwise, you use the host name straight.

Re: generating hash from packet content

Posted by Leif Hedstrom <zw...@apache.org>.

On Aug 27, 2014, at 1:51 PM, Nick Kew <ni...@apache.org> wrote:

> On Wed, 27 Aug 2014 16:17:17 +0000
> Rasim Saltuk Alakuş <ra...@turksat.com.tr> wrote:
> 
>> 
>> Hi All,
>> 
>> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more flexibility in URL hashing strategy.
>> 
>> We think of creating hash based on packet content and use it as the hash while storing and retrieving from cache This looks a better solution, so that URI changes won't hurt caching system. One immediate benefit for example if you cache YouTube , each request for same video can have different URL and CacheUrl plugin does not always provide a good solution. Also maintaining site based hash filters looks not an elegant solution.
>> 
>> Is there any previous or active work for implementing content based hashing? What kind of problems and constrains you may guess. Is there any volunteer to implement this feature together with us?
> 
> 
> Indeed, the whole scheme is BAD (Broken As Designed).
> Using different URLs for common content breaks cacheing on
> the Web at large, and hacking one agent (such as Trafficserver)
> to work around it will gain you only a tiny fraction of what
> you've thrown away.  Indeed, if every agent on the Web -
> from origin servers to desktop browsers - implemented this
> cacheing scheme, you'd still lose MOST of the benefits of
> cacheing, as the same content passes through different paths.

I thought some more on this over a boring meeting, two more thoughts comes to mind:

1) Cache poisoning. This could be a serious problem, at a minimum some defenses such as using the Host: portion of the request for the cache key would be required. But, I’m guessing that still would be possible to abuse, to poison the HTTP caches (since the client request + origin response headers no longer dictates the cache lookup).

2) HTTP/2. Albeit it supports non-TLS, several browser vendors have indicated they will not support H2 over plain text. So, assuming we’re moving towards TLS across the board, this sort of interaction will get more tricky. I personally think it’ll have to evolve in a way that the content owners will need to participate better with caches. It’s too early to say, but maybe such a proposal would encourage the YouTube’s and Netflix’es to behave better (in some way that they can still control content, ad impressions, click tracking etc. etc. yet allow ISPs to cache the actual content).

Just my $0.01,

— Leif

Re: generating hash from packet content

Posted by Nick Kew <ni...@apache.org>.

On Wed, 27 Aug 2014 16:17:17 +0000
Rasim Saltuk Alakuş <ra...@turksat.com.tr> wrote:

> 
> Hi All,
> 
> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more flexibility in URL hashing strategy.
> 
> We think of creating hash based on packet content and use it as the hash while storing and retrieving from cache This looks a better solution, so that URI changes won't hurt caching system. One immediate benefit for example if you cache YouTube , each request for same video can have different URL and CacheUrl plugin does not always provide a good solution. Also maintaining site based hash filters looks not an elegant solution.
> 
> Is there any previous or active work for implementing content based hashing? What kind of problems and constrains you may guess. Is there any volunteer to implement this feature together with us?

It would be straightforward enough to implement, though
I think rather expensive in computation.  But what does
it gain you?  A possible many-to-one URL to local cache map,
but you still have to deal per-URL with all the complexities
like content negotiation and cache validation.

Indeed, the whole scheme is BAD (Broken As Designed).
Using different URLs for common content breaks cacheing on
the Web at large, and hacking one agent (such as Trafficserver)
to work around it will gain you only a tiny fraction of what
you've thrown away.  Indeed, if every agent on the Web -
from origin servers to desktop browsers - implemented this
cacheing scheme, you'd still lose MOST of the benefits of
cacheing, as the same content passes through different paths.

-- 
Nick Kew

RE: generating hash from packet content

Posted by Luca Rea <lu...@contactlab.com>.

Mmmm... can help something like the following below?

Client(Request=Normal URL) -> ATS(Lua) -> NoSQL (PUT: key=hash,value=url object) -> Origin
Client <- ATS(Cache) <- Origin

Client(Request=HASH Cache)  -> ATS(Lua) -> NoSQL (GET: url object) -> ATS(Cache)
Client <- ATS(Cache) 


You can use an in-memory or persistent NoSQL and set an expire timeout in the stored records.

That's just an idea.

RE: generating hash from packet content

Posted by Luca Rea <lu...@contactlab.com>.

Mmmm... can help something like the following below?

Client(Request=Normal URL) -> ATS(Lua) -> NoSQL (PUT: key=hash,value=url object) -> Origin
Client <- ATS(Cache) <- Origin

Client(Request=HASH Cache)  -> ATS(Lua) -> NoSQL (GET: url object) -> ATS(Cache)
Client <- ATS(Cache) 


You can use an in-memory or persistent NoSQL and set an expire timeout in the stored records.

That's just an idea.

Re: generating hash from packet content

Posted by "Alan M. Carroll" <am...@network-geographics.com>.

Well, it would definitely be possible to store an indirection object to implement Bill's idea. The URL is used to do a lookup and the object that is returned is a forwarding header, which then causes another lookup. Basically it's a form of remap for the cache, using the cache itself to store the remap table data.

This is the kind of things that would be relatively easy to implement with the Cache Toolkit. I must work on that again someday...

 >> Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.
>   
>  
>  I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?
>  



> Maybe I did not understand it correctly. Currently, ATS calculates a hash from a URL and uses the hash to look up the actual object. That is "URL --> actual object". My idea is to "URL --> hash of an object --> actual object". We calculate the hash of a URL and use that to look up the hash of an actual object and then use the hash of the actual object to look up the actual object. 


>  
>  — leif
>  
>

RE: generating hash from packet content

Posted by Luca Rea <lu...@contactlab.com>.

what about a post-optimization of the cache? I mean... 

1. when ATS receives a huge data it stores the URLs with a rounded timestamp and the flag "checked:true/false" into a RDBMS  (eg. postgresql) with a unique constraint on URLs and timestamp fields 
2. a batch process periodically get URLs ( last_check_time<timestamp, checked=false) from DB, requests them to ATS that has cached them, calculates SHA and then performs two queries to a NoSQL: insert "key:URL,value:SHA" into table "A" (always), insert "key:SHA, value:URL" into table "B" (if not exists, else update the expire timeout for this key and delete the ATS cache of the new URL), finally set flag checked=true
3. when ATS receives requests from a client (not the batch process) it looks for records in table "A" of NoSQL, if a value is returned it looks for the url from table "B" and finally returns its cached data, else forward request to origin.

Obviously you should estimate the convenience of something like that. Do you have so much huge traffic/cache?

RE: generating hash from packet content

Posted by Luca Rea <lu...@contactlab.com>.

what about a post-optimization of the cache? I mean... 

1. when ATS receives a huge data it stores the URLs with a rounded timestamp and the flag "checked:true/false" into a RDBMS  (eg. postgresql) with a unique constraint on URLs and timestamp fields 
2. a batch process periodically get URLs ( last_check_time<timestamp, checked=false) from DB, requests them to ATS that has cached them, calculates SHA and then performs two queries to a NoSQL: insert "key:URL,value:SHA" into table "A" (always), insert "key:SHA, value:URL" into table "B" (if not exists, else update the expire timeout for this key and delete the ATS cache of the new URL), finally set flag checked=true
3. when ATS receives requests from a client (not the batch process) it looks for records in table "A" of NoSQL, if a value is returned it looks for the url from table "B" and finally returns its cached data, else forward request to origin.

Obviously you should estimate the convenience of something like that. Do you have so much huge traffic/cache?

Re: generating hash from packet content

Posted by Yongming Zhao <mi...@gmail.com>.

I’d agree that Leif point out the problem here, we may call this a de-duplicate solution but mostly after we save the content when we get from the origin, it is already wasting your disk storage, you will get the same hash after all the data is completed from the origin, and the disk already wasted in this duplicated file.

a good solution would be:
the origin send out the content with common headers plus SHA hash string and(or) MD5 hash string, and then we can go lookup the key in our storage, then it should work as expected




在 2014年8月29日，上午4:09，Leif Hedstrom <zw...@apache.org> 写道：

> 
> On Aug 28, 2014, at 12:19 PM, Bill Zeng <bi...@gmail.com> wrote:
> 
>> 
>> 
>> 
>> On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zw...@apache.org> wrote:
>> 
>> On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:
>> 
>>> Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.
>> 
>> 
>> I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?
>> 
>> Maybe I did not understand it correctly. Currently, ATS calculates a hash from a URL and uses the hash to look up the actual object. That is "URL --> actual object". My idea is to "URL --> hash of an object --> actual object". We calculate the hash of a URL and use that to look up the hash of an actual object and then use the hash of the actual object to look up the actual object.
> 
> 
> But what problem does that solve? You have URL <A> and <B>, both which  point to the same object. How do you find that object based only on the client request (URL + headers)? How do you generate the “object hash” for the lookup, without going to origin first? That’s the problem here, afaik?
> 
> Or is your suggestion here to solve the cache deduping problem (which is not what the OP asked for)? If so, there was the beginning for that in the cache code, storing the hash of objects in the cache as well (but maybe that’s gone now?). There is also a CRC (checksum) feature in the cache, maybe the intention back then was to generalizing the cache dedup with these checksums. Only John Plevyak would know :).
> 
> Fwiw, this problem is what Metalink is intended to solve for some use cases (e.g. site mirrors), but Metalink requires cooperation (additional Metalink headers) from the origin. It does not solve (or intend to solve) the issue where e.g. YouTube rotates the content URLs frequently.
> 
> — Leif

- Yongming Zhao 赵永明

Re: generating hash from packet content

Posted by Yongming Zhao <mi...@gmail.com>.

I’d agree that Leif point out the problem here, we may call this a de-duplicate solution but mostly after we save the content when we get from the origin, it is already wasting your disk storage, you will get the same hash after all the data is completed from the origin, and the disk already wasted in this duplicated file.

a good solution would be:
the origin send out the content with common headers plus SHA hash string and(or) MD5 hash string, and then we can go lookup the key in our storage, then it should work as expected




在 2014年8月29日，上午4:09，Leif Hedstrom <zw...@apache.org> 写道：

> 
> On Aug 28, 2014, at 12:19 PM, Bill Zeng <bi...@gmail.com> wrote:
> 
>> 
>> 
>> 
>> On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zw...@apache.org> wrote:
>> 
>> On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:
>> 
>>> Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.
>> 
>> 
>> I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?
>> 
>> Maybe I did not understand it correctly. Currently, ATS calculates a hash from a URL and uses the hash to look up the actual object. That is "URL --> actual object". My idea is to "URL --> hash of an object --> actual object". We calculate the hash of a URL and use that to look up the hash of an actual object and then use the hash of the actual object to look up the actual object.
> 
> 
> But what problem does that solve? You have URL <A> and <B>, both which  point to the same object. How do you find that object based only on the client request (URL + headers)? How do you generate the “object hash” for the lookup, without going to origin first? That’s the problem here, afaik?
> 
> Or is your suggestion here to solve the cache deduping problem (which is not what the OP asked for)? If so, there was the beginning for that in the cache code, storing the hash of objects in the cache as well (but maybe that’s gone now?). There is also a CRC (checksum) feature in the cache, maybe the intention back then was to generalizing the cache dedup with these checksums. Only John Plevyak would know :).
> 
> Fwiw, this problem is what Metalink is intended to solve for some use cases (e.g. site mirrors), but Metalink requires cooperation (additional Metalink headers) from the origin. It does not solve (or intend to solve) the issue where e.g. YouTube rotates the content URLs frequently.
> 
> — Leif

- Yongming Zhao 赵永明

Re: generating hash from packet content

Posted by Leif Hedstrom <zw...@apache.org>.

On Aug 28, 2014, at 12:19 PM, Bill Zeng <bi...@gmail.com> wrote:

> 
> 
> 
> On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zw...@apache.org> wrote:
> 
> On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:
> 
> > Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.
> 
> 
> I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?
> 
> Maybe I did not understand it correctly. Currently, ATS calculates a hash from a URL and uses the hash to look up the actual object. That is "URL --> actual object". My idea is to "URL --> hash of an object --> actual object". We calculate the hash of a URL and use that to look up the hash of an actual object and then use the hash of the actual object to look up the actual object.

But what problem does that solve? You have URL <A> and <B>, both which  point to the same object. How do you find that object based only on the client request (URL + headers)? How do you generate the “object hash” for the lookup, without going to origin first? That’s the problem here, afaik?

Or is your suggestion here to solve the cache deduping problem (which is not what the OP asked for)? If so, there was the beginning for that in the cache code, storing the hash of objects in the cache as well (but maybe that’s gone now?). There is also a CRC (checksum) feature in the cache, maybe the intention back then was to generalizing the cache dedup with these checksums. Only John Plevyak would know :).

Fwiw, this problem is what Metalink is intended to solve for some use cases (e.g. site mirrors), but Metalink requires cooperation (additional Metalink headers) from the origin. It does not solve (or intend to solve) the issue where e.g. YouTube rotates the content URLs frequently.

— Leif

Re: generating hash from packet content

Posted by Leif Hedstrom <zw...@apache.org>.

On Aug 28, 2014, at 12:19 PM, Bill Zeng <bi...@gmail.com> wrote:

> 
> 
> 
> On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zw...@apache.org> wrote:
> 
> On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:
> 
> > Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.
> 
> 
> I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?
> 
> Maybe I did not understand it correctly. Currently, ATS calculates a hash from a URL and uses the hash to look up the actual object. That is "URL --> actual object". My idea is to "URL --> hash of an object --> actual object". We calculate the hash of a URL and use that to look up the hash of an actual object and then use the hash of the actual object to look up the actual object.

But what problem does that solve? You have URL <A> and <B>, both which  point to the same object. How do you find that object based only on the client request (URL + headers)? How do you generate the “object hash” for the lookup, without going to origin first? That’s the problem here, afaik?

Or is your suggestion here to solve the cache deduping problem (which is not what the OP asked for)? If so, there was the beginning for that in the cache code, storing the hash of objects in the cache as well (but maybe that’s gone now?). There is also a CRC (checksum) feature in the cache, maybe the intention back then was to generalizing the cache dedup with these checksums. Only John Plevyak would know :).

Fwiw, this problem is what Metalink is intended to solve for some use cases (e.g. site mirrors), but Metalink requires cooperation (additional Metalink headers) from the origin. It does not solve (or intend to solve) the issue where e.g. YouTube rotates the content URLs frequently.

— Leif

Re: generating hash from packet content

Posted by "Alan M. Carroll" <am...@network-geographics.com>.

Well, it would definitely be possible to store an indirection object to implement Bill's idea. The URL is used to do a lookup and the object that is returned is a forwarding header, which then causes another lookup. Basically it's a form of remap for the cache, using the cache itself to store the remap table data.

This is the kind of things that would be relatively easy to implement with the Cache Toolkit. I must work on that again someday...

 >> Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.
>   
>  
>  I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?
>  



> Maybe I did not understand it correctly. Currently, ATS calculates a hash from a URL and uses the hash to look up the actual object. That is "URL --> actual object". My idea is to "URL --> hash of an object --> actual object". We calculate the hash of a URL and use that to look up the hash of an actual object and then use the hash of the actual object to look up the actual object. 


>  
>  — leif
>  
>

Re: generating hash from packet content

Posted by Bill Zeng <bi...@gmail.com>.

On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zw...@apache.org> wrote:

>
> On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:
>
> > Just to throw another idea your way. We can insert another level of
> indirection between URL's and objects. Every object has a unique hash.
> URL's point to the hashes instead of objects. The hashes are used to look
> up objects. Even if multiple URL's are duplicated and hence their hashes,
> they always point to the same object. It seems a non-easy project though.
> It requires major changes to ATS.
>
>
> I’m not sure I understand this, or how it helps this problem? However,
> isn’t this sort of how the cache already works? There’s a hash from URL to
> the “header” entry, which then has its own hash to the actual object. Alan?
>

Maybe I did not understand it correctly. Currently, ATS calculates a hash
from a URL and uses the hash to look up the actual object. That is "URL -->
actual object". My idea is to "URL --> hash of an object --> actual
object". We calculate the hash of a URL and use that to look up the hash of
an actual object and then use the hash of the actual object to look up the
actual object.

> — leif
>
>

Re: generating hash from packet content

Posted by Bill Zeng <bi...@gmail.com>.

On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zw...@apache.org> wrote:

>
> On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:
>
> > Just to throw another idea your way. We can insert another level of
> indirection between URL's and objects. Every object has a unique hash.
> URL's point to the hashes instead of objects. The hashes are used to look
> up objects. Even if multiple URL's are duplicated and hence their hashes,
> they always point to the same object. It seems a non-easy project though.
> It requires major changes to ATS.
>
>
> I’m not sure I understand this, or how it helps this problem? However,
> isn’t this sort of how the cache already works? There’s a hash from URL to
> the “header” entry, which then has its own hash to the actual object. Alan?
>

Maybe I did not understand it correctly. Currently, ATS calculates a hash
from a URL and uses the hash to look up the actual object. That is "URL -->
actual object". My idea is to "URL --> hash of an object --> actual
object". We calculate the hash of a URL and use that to look up the hash of
an actual object and then use the hash of the actual object to look up the
actual object.

> — leif
>
>

Re: generating hash from packet content

Posted by Leif Hedstrom <zw...@apache.org>.

On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:

> Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.


I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?

— leif

Re: generating hash from packet content

Posted by Leif Hedstrom <zw...@apache.org>.

On Aug 28, 2014, at 11:35 AM, Bill Zeng <bi...@gmail.com> wrote:

> Just to throw another idea your way. We can insert another level of indirection between URL's and objects. Every object has a unique hash. URL's point to the hashes instead of objects. The hashes are used to look up objects. Even if multiple URL's are duplicated and hence their hashes, they always point to the same object. It seems a non-easy project though. It requires major changes to ATS.


I’m not sure I understand this, or how it helps this problem? However, isn’t this sort of how the cache already works? There’s a hash from URL to the “header” entry, which then has its own hash to the actual object. Alan?

— leif

Re: generating hash from packet content

Posted by Bill Zeng <bi...@gmail.com>.

Just to throw another idea your way. We can insert another level of
indirection between URL's and objects. Every object has a unique hash.
URL's point to the hashes instead of objects. The hashes are used to look
up objects. Even if multiple URL's are duplicated and hence their hashes,
they always point to the same object. It seems a non-easy project though.
It requires major changes to ATS.

Bin



On Thu, Aug 28, 2014 at 12:50 AM, Niki Gorchilov <ni...@gorchilov.com> wrote:

> Hi, Rasim,
>
> AFAICT metalink plugin has a code to calculate checksum of the object
> contents. Still I don't understand how this is going to resolve the problem
> you're trying to address.
>
> In order to have the hash, you need to download the whole object from
> origin server, thus you learn if you have it already cached post factum.
> Then how do you save bandwidth? In such mode, you can only save storage
> space by de-duplication of objects (storing them once instead of multiple
> times).
>
> Second issue I can foresee is RAM exhaust as you need to buffer the whole
> object in memory before making store decision. Especially in terms of
> youtube videos you're planing to process.
>
> HTH,
> Niki
>
>
>
>
> 2014-08-27 19:17 GMT+03:00 Rasim Saltuk Alakuş <ra...@turksat.com.tr>:
>
>
>>
>> Hi All,
>>
>>
>>
>> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more
>> flexibility in URL hashing strategy.
>>
>>
>>
>> We think of creating hash based on packet content and use it as the hash
>> while storing and retrieving from cache This looks a better solution, so
>> that URI changes won't hurt caching system. One immediate benefit for
>> example if you cache YouTube , each request for same video can have
>> different URL and CacheUrl plugin does not always provide a good solution.
>> Also maintaining site based hash filters looks not an elegant solution.
>>
>>
>>
>> Is there any previous or active work for implementing content based
>> hashing? What kind of problems and constrains you may guess. Is there any
>> volunteer to implement this feature together with us?
>>
>>
>>
>> Kind regards
>>
>> Saltuk Alakuş
>>
>>
>> *Rasim Saltuk Alakuş *
>> Kıdemli Uzman
>>  Senior Specialist
>>  Bilişim Ar-Ge ve Teknoloji Direktörlüğü
>>  IT R & D and Technology
>>
>>
>> *www.turksat.com.tr <http://www.turksat.com.tr> *ralakus@turksat.com.tr
>>
>> TUNA MAH. İSMAİL ÖZKUNT 1709 SOK. NO:3 KAT:2 KARŞIYAKA – İZMİR
>> T : +90 232 323 43 00
>> F : +90 232 323 43 44
>>
>>
>>
>>
>>  "Bu mesaj ve ekleri mesajda gönderildiği belirtilen kişi ya da kişilere
>> özel olup gizli bilgiler içeriyor olabilir. Mesajın muhatabı ilgilisi ya da
>> gönderileni değilseniz lütfen mesajı herhangi bir şekilde kullanmayınız
>> çoğaltmayınız ve başkalarına ifşa etmeyiniz. Eğer mesaj yanlışlıkla size
>> ulaşmışsa anılan mesaj ve ekinde yer alan bilgileri gizli tutunuz ve mesajı
>> gönderen kişiyi bilgilendirerek söz konusu mesaj ile eklerini derhal imha
>> ediniz. Bu mesaj ve ekindeki belgelerin bilinen virüslere karşı kontrolü
>> yapılmıştır. Ancak e-posta sistemlerinin taşıdığı risklerden dolayı
>> şirketimiz bu mesajın ve içerdiği bilgilerin size değişikliğe uğrayarak
>> veya geç ulaşmasından bütünlüğünün ve gizliliğinin korunamamasından virüs
>> içermesinden ve herhangi bir sebeple bilgisayarınıza ve sisteminize
>> verebileceği zararlardan sorumlu tutulamaz.”<<<<<
>>
>> “This message together with its attachments is intended solely for the
>> address(es) and may contain confidential or privileged information. If you
>> are not the intended recipient please do not use copy or disclose the
>> message for any purpose. Should you receive this message by mistake please
>> keep all information contained in the message or its attachments strictly
>> confidential and advise the sender and delete it immediately without
>> retaining a copy. This message and its attachments have been swept by
>> anti-virus systems for the presence of known viruses. However due to the
>> risks of e-mail systems our company cannot accept liability for any changes
>> or delay in receiving loss of integrity and confidentiality containing
>> viruses and any damages caused in any way to your computer and system
>> recipient, you are notified that disclosing, distributing, or copying this
>> e-mail is strictly prohibited. “
>>
>>
>>
>

Re: generating hash from packet content

Posted by Bill Zeng <bi...@gmail.com>.

Just to throw another idea your way. We can insert another level of
indirection between URL's and objects. Every object has a unique hash.
URL's point to the hashes instead of objects. The hashes are used to look
up objects. Even if multiple URL's are duplicated and hence their hashes,
they always point to the same object. It seems a non-easy project though.
It requires major changes to ATS.

Bin



On Thu, Aug 28, 2014 at 12:50 AM, Niki Gorchilov <ni...@gorchilov.com> wrote:

> Hi, Rasim,
>
> AFAICT metalink plugin has a code to calculate checksum of the object
> contents. Still I don't understand how this is going to resolve the problem
> you're trying to address.
>
> In order to have the hash, you need to download the whole object from
> origin server, thus you learn if you have it already cached post factum.
> Then how do you save bandwidth? In such mode, you can only save storage
> space by de-duplication of objects (storing them once instead of multiple
> times).
>
> Second issue I can foresee is RAM exhaust as you need to buffer the whole
> object in memory before making store decision. Especially in terms of
> youtube videos you're planing to process.
>
> HTH,
> Niki
>
>
>
>
> 2014-08-27 19:17 GMT+03:00 Rasim Saltuk Alakuş <ra...@turksat.com.tr>:
>
>
>>
>> Hi All,
>>
>>
>>
>> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more
>> flexibility in URL hashing strategy.
>>
>>
>>
>> We think of creating hash based on packet content and use it as the hash
>> while storing and retrieving from cache This looks a better solution, so
>> that URI changes won't hurt caching system. One immediate benefit for
>> example if you cache YouTube , each request for same video can have
>> different URL and CacheUrl plugin does not always provide a good solution.
>> Also maintaining site based hash filters looks not an elegant solution.
>>
>>
>>
>> Is there any previous or active work for implementing content based
>> hashing? What kind of problems and constrains you may guess. Is there any
>> volunteer to implement this feature together with us?
>>
>>
>>
>> Kind regards
>>
>> Saltuk Alakuş
>>
>>
>> *Rasim Saltuk Alakuş *
>> Kıdemli Uzman
>>  Senior Specialist
>>  Bilişim Ar-Ge ve Teknoloji Direktörlüğü
>>  IT R & D and Technology
>>
>>
>> *www.turksat.com.tr <http://www.turksat.com.tr> *ralakus@turksat.com.tr
>>
>> TUNA MAH. İSMAİL ÖZKUNT 1709 SOK. NO:3 KAT:2 KARŞIYAKA – İZMİR
>> T : +90 232 323 43 00
>> F : +90 232 323 43 44
>>
>>
>>
>>
>>  "Bu mesaj ve ekleri mesajda gönderildiği belirtilen kişi ya da kişilere
>> özel olup gizli bilgiler içeriyor olabilir. Mesajın muhatabı ilgilisi ya da
>> gönderileni değilseniz lütfen mesajı herhangi bir şekilde kullanmayınız
>> çoğaltmayınız ve başkalarına ifşa etmeyiniz. Eğer mesaj yanlışlıkla size
>> ulaşmışsa anılan mesaj ve ekinde yer alan bilgileri gizli tutunuz ve mesajı
>> gönderen kişiyi bilgilendirerek söz konusu mesaj ile eklerini derhal imha
>> ediniz. Bu mesaj ve ekindeki belgelerin bilinen virüslere karşı kontrolü
>> yapılmıştır. Ancak e-posta sistemlerinin taşıdığı risklerden dolayı
>> şirketimiz bu mesajın ve içerdiği bilgilerin size değişikliğe uğrayarak
>> veya geç ulaşmasından bütünlüğünün ve gizliliğinin korunamamasından virüs
>> içermesinden ve herhangi bir sebeple bilgisayarınıza ve sisteminize
>> verebileceği zararlardan sorumlu tutulamaz.”<<<<<
>>
>> “This message together with its attachments is intended solely for the
>> address(es) and may contain confidential or privileged information. If you
>> are not the intended recipient please do not use copy or disclose the
>> message for any purpose. Should you receive this message by mistake please
>> keep all information contained in the message or its attachments strictly
>> confidential and advise the sender and delete it immediately without
>> retaining a copy. This message and its attachments have been swept by
>> anti-virus systems for the presence of known viruses. However due to the
>> risks of e-mail systems our company cannot accept liability for any changes
>> or delay in receiving loss of integrity and confidentiality containing
>> viruses and any damages caused in any way to your computer and system
>> recipient, you are notified that disclosing, distributing, or copying this
>> e-mail is strictly prohibited. “
>>
>>
>>
>

Re: generating hash from packet content

Posted by Niki Gorchilov <ni...@gorchilov.com>.

Hi, Rasim,

AFAICT metalink plugin has a code to calculate checksum of the object
contents. Still I don't understand how this is going to resolve the problem
you're trying to address.

In order to have the hash, you need to download the whole object from
origin server, thus you learn if you have it already cached post factum.
Then how do you save bandwidth? In such mode, you can only save storage
space by de-duplication of objects (storing them once instead of multiple
times).

Second issue I can foresee is RAM exhaust as you need to buffer the whole
object in memory before making store decision. Especially in terms of
youtube videos you're planing to process.

HTH,
Niki




2014-08-27 19:17 GMT+03:00 Rasim Saltuk Alakuş <ra...@turksat.com.tr>:

>
>
> Hi All,
>
>
>
> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more
> flexibility in URL hashing strategy.
>
>
>
> We think of creating hash based on packet content and use it as the hash
> while storing and retrieving from cache This looks a better solution, so
> that URI changes won't hurt caching system. One immediate benefit for
> example if you cache YouTube , each request for same video can have
> different URL and CacheUrl plugin does not always provide a good solution.
> Also maintaining site based hash filters looks not an elegant solution.
>
>
>
> Is there any previous or active work for implementing content based
> hashing? What kind of problems and constrains you may guess. Is there any
> volunteer to implement this feature together with us?
>
>
>
> Kind regards
>
> Saltuk Alakuş
>
>
> *Rasim Saltuk Alakuş *
> Kıdemli Uzman
>  Senior Specialist
>  Bilişim Ar-Ge ve Teknoloji Direktörlüğü
>  IT R & D and Technology
>
>
> *www.turksat.com.tr <http://www.turksat.com.tr> *ralakus@turksat.com.tr
>
> TUNA MAH. İSMAİL ÖZKUNT 1709 SOK. NO:3 KAT:2 KARŞIYAKA – İZMİR
> T : +90 232 323 43 00
> F : +90 232 323 43 44
>
>
>
>
>  "Bu mesaj ve ekleri mesajda gönderildiği belirtilen kişi ya da kişilere
> özel olup gizli bilgiler içeriyor olabilir. Mesajın muhatabı ilgilisi ya da
> gönderileni değilseniz lütfen mesajı herhangi bir şekilde kullanmayınız
> çoğaltmayınız ve başkalarına ifşa etmeyiniz. Eğer mesaj yanlışlıkla size
> ulaşmışsa anılan mesaj ve ekinde yer alan bilgileri gizli tutunuz ve mesajı
> gönderen kişiyi bilgilendirerek söz konusu mesaj ile eklerini derhal imha
> ediniz. Bu mesaj ve ekindeki belgelerin bilinen virüslere karşı kontrolü
> yapılmıştır. Ancak e-posta sistemlerinin taşıdığı risklerden dolayı
> şirketimiz bu mesajın ve içerdiği bilgilerin size değişikliğe uğrayarak
> veya geç ulaşmasından bütünlüğünün ve gizliliğinin korunamamasından virüs
> içermesinden ve herhangi bir sebeple bilgisayarınıza ve sisteminize
> verebileceği zararlardan sorumlu tutulamaz.”<<<<<
>
> “This message together with its attachments is intended solely for the
> address(es) and may contain confidential or privileged information. If you
> are not the intended recipient please do not use copy or disclose the
> message for any purpose. Should you receive this message by mistake please
> keep all information contained in the message or its attachments strictly
> confidential and advise the sender and delete it immediately without
> retaining a copy. This message and its attachments have been swept by
> anti-virus systems for the presence of known viruses. However due to the
> risks of e-mail systems our company cannot accept liability for any changes
> or delay in receiving loss of integrity and confidentiality containing
> viruses and any damages caused in any way to your computer and system
> recipient, you are notified that disclosing, distributing, or copying this
> e-mail is strictly prohibited. “
>
>
>

Re: generating hash from packet content

Posted by Niki Gorchilov <ni...@gorchilov.com>.

Hi, Rasim,

AFAICT metalink plugin has a code to calculate checksum of the object
contents. Still I don't understand how this is going to resolve the problem
you're trying to address.

In order to have the hash, you need to download the whole object from
origin server, thus you learn if you have it already cached post factum.
Then how do you save bandwidth? In such mode, you can only save storage
space by de-duplication of objects (storing them once instead of multiple
times).

Second issue I can foresee is RAM exhaust as you need to buffer the whole
object in memory before making store decision. Especially in terms of
youtube videos you're planing to process.

HTH,
Niki




2014-08-27 19:17 GMT+03:00 Rasim Saltuk Alakuş <ra...@turksat.com.tr>:

>
>
> Hi All,
>
>
>
> ATS uses URL hash for cache storage. And CacheUrl plugin adds some more
> flexibility in URL hashing strategy.
>
>
>
> We think of creating hash based on packet content and use it as the hash
> while storing and retrieving from cache This looks a better solution, so
> that URI changes won't hurt caching system. One immediate benefit for
> example if you cache YouTube , each request for same video can have
> different URL and CacheUrl plugin does not always provide a good solution.
> Also maintaining site based hash filters looks not an elegant solution.
>
>
>
> Is there any previous or active work for implementing content based
> hashing? What kind of problems and constrains you may guess. Is there any
> volunteer to implement this feature together with us?
>
>
>
> Kind regards
>
> Saltuk Alakuş
>
>
> *Rasim Saltuk Alakuş *
> Kıdemli Uzman
>  Senior Specialist
>  Bilişim Ar-Ge ve Teknoloji Direktörlüğü
>  IT R & D and Technology
>
>
> *www.turksat.com.tr <http://www.turksat.com.tr> *ralakus@turksat.com.tr
>
> TUNA MAH. İSMAİL ÖZKUNT 1709 SOK. NO:3 KAT:2 KARŞIYAKA – İZMİR
> T : +90 232 323 43 00
> F : +90 232 323 43 44
>
>
>
>
>  "Bu mesaj ve ekleri mesajda gönderildiği belirtilen kişi ya da kişilere
> özel olup gizli bilgiler içeriyor olabilir. Mesajın muhatabı ilgilisi ya da
> gönderileni değilseniz lütfen mesajı herhangi bir şekilde kullanmayınız
> çoğaltmayınız ve başkalarına ifşa etmeyiniz. Eğer mesaj yanlışlıkla size
> ulaşmışsa anılan mesaj ve ekinde yer alan bilgileri gizli tutunuz ve mesajı
> gönderen kişiyi bilgilendirerek söz konusu mesaj ile eklerini derhal imha
> ediniz. Bu mesaj ve ekindeki belgelerin bilinen virüslere karşı kontrolü
> yapılmıştır. Ancak e-posta sistemlerinin taşıdığı risklerden dolayı
> şirketimiz bu mesajın ve içerdiği bilgilerin size değişikliğe uğrayarak
> veya geç ulaşmasından bütünlüğünün ve gizliliğinin korunamamasından virüs
> içermesinden ve herhangi bir sebeple bilgisayarınıza ve sisteminize
> verebileceği zararlardan sorumlu tutulamaz.”<<<<<
>
> “This message together with its attachments is intended solely for the
> address(es) and may contain confidential or privileged information. If you
> are not the intended recipient please do not use copy or disclose the
> message for any purpose. Should you receive this message by mistake please
> keep all information contained in the message or its attachments strictly
> confidential and advise the sender and delete it immediately without
> retaining a copy. This message and its attachments have been swept by
> anti-virus systems for the presence of known viruses. However due to the
> risks of e-mail systems our company cannot accept liability for any changes
> or delay in receiving loss of integrity and confidentiality containing
> viruses and any damages caused in any way to your computer and system
> recipient, you are notified that disclosing, distributing, or copying this
> e-mail is strictly prohibited. “
>
>
>