You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@thrift.apache.org by Abhay M <it...@gmail.com> on 2010/06/11 17:26:20 UTC

Serializing large data sets

Hi,

Are there any know concerns with serializing large data sets with Thrift? I
am looking to serialize messages with 10-150K records, sometimes resulting
in ~30M per message. These messages are serialized for storage.

I have been experimenting with Google protobuf and saw this in the
documentation (
http://code.google.com/apis/protocolbuffers/docs/techniques.html) -
"Protocol Buffers are not designed to handle large messages. As a general
rule of thumb, if you are dealing in messages larger than a megabyte each,
it may be time to consider an alternate strategy."
FWIW, I did switch to delimited write/parse API (Java only) as recommended
in the doc and it works well. But, Python protobuf impl lacks this API and
is slow.

Thanks
Abhay

Re: Serializing large data sets

Posted by Abhay M <it...@gmail.com>.

Thanks! This is helpful.

I'll try to get a sense of how much RAM it'll take to deserialize for this
type of messages -

struct TRecordList{
1: list<TRec> records,
}

Assuming message is first parsed into TRec beans (because it is defined as
list of beans), which are in turn converted into application beans, I am
guessing approximately 3 times the size of serialized message (probably
more).

Thanks again


On Fri, Jun 11, 2010 at 11:32 AM, Dave Engberg <de...@evernote.com>wrote:

>
> Evernote uses Thrift for all client-server communications, including
> third-party API integrations (http://www.evernote.com/about/developer/api/).
>  We serialize messages up to 55MB via Thrift.  This is very efficient on the
> wire, but marshalling and unmarshalling objects can take a fair amount of
> RAM due to various temporary buffers built into the networking and IO
> runtime libraries.
>
>
>
> On 6/11/10 8:26 AM, Abhay M wrote:
>
>> Hi,
>>
>> Are there any know concerns with serializing large data sets with Thrift?
>> I
>> am looking to serialize messages with 10-150K records, sometimes resulting
>> in ~30M per message. These messages are serialized for storage.
>>
>> I have been experimenting with Google protobuf and saw this in the
>> documentation (
>> http://code.google.com/apis/protocolbuffers/docs/techniques.html) -
>> "Protocol Buffers are not designed to handle large messages. As a general
>> rule of thumb, if you are dealing in messages larger than a megabyte each,
>> it may be time to consider an alternate strategy."
>> FWIW, I did switch to delimited write/parse API (Java only) as recommended
>> in the doc and it works well. But, Python protobuf impl lacks this API and
>> is slow.
>>
>> Thanks
>> Abhay
>>
>>
>>
>

Re: Serializing large data sets

Posted by Dave Engberg <de...@evernote.com>.

On 6/11/10 10:27 AM, Bjørn Borud wrote:
>
> So I take it you put your thrift server behind Apache or similar and 
> then just proxy the requests to the actual thrift http servers (so you 
> can let Apache take care of the SSL bit and then use regular HTTP 
> internally?)

Yes, the HTTP Thrift connections go through firewalls into Citrix load 
balancers that also handle SSL.  The balancers forward the HTTP 
internally to one of the ~50 internal boxes based on patterns in the 
URI.  (e.g. /edam/note/s14 goes to shard #14), or round-robin otherwise.

> and as a happy (paying) Evernote customer who uses Evernote on all my 
> machines, Android and iPhone I can tell you that it works brilliantly :-)
Excellent, thanks!

Re: Serializing large data sets

Posted by Bjørn Borud <bb...@gmail.com>.

2010/6/11 Dave Engberg <de...@evernote.com>

>
> No, we only use HTTP Transport.  For anything on the public Internet, this
> is the only way to go ... it also gives you lots of extra advantages like
> client firewall support, hardware load balancing, SSL "for free", etc.  When
> we were adopting Thrift three years ago, I did some synthetic load tests to
> compare the overhead of THttpClient transport versus direct binary
> transport.  If the HTTP stack supports proper HTTP Keep-Alive, the overhead
> was negligible (under 20%).  Unfortunately, several languages don't do
> proper keep-alive in their HTTP libraries by default, so your mileage may
> vary drastically.
>

So I take it you put your thrift server behind Apache or similar and then
just proxy the requests to the actual thrift http servers (so you can let
Apache take care of the SSL bit and then use regular HTTP internally?)



> We mitigate against Thrift-related denial of services through a mix of
> measures that should (hopefully) make a Thrift protocol attack less fruitful
> than other attacks.  (I.e. so that Thrift isn't the weakest link.)
> For example, we use maxSkipDepth() to avoid bogus sequences of nested
> structures:
>
> http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/java/src/org/apache/thrift/protocol/TProtocolUtil.java?revision=760189&view=co
> And we determine the total incoming message length via the HTTP
> Content-Length header to reject big messages before parsing, and use this as
> a limit to TBinaryProtocol.setReadLength() to automatically reject bogus
> object length/size fields:
>
> http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/java/src/org/apache/thrift/protocol/TBinaryProtocol.java?view=co


ok, thanks.  those are valuable tips.


>
> Our use of Thrift is obviously a bit unusual compared to most folks using
> it for internal server-server communications, but we have millions of
> distinct client machines talking Thrift to Evernote every month, so I can
> vouch that it works.
>
>
and as a happy (paying) Evernote customer who uses Evernote on all my
machines, Android and iPhone I can tell you that it works brilliantly :-)

-Bjørn

Re: Serializing large data sets

Posted by Dave Engberg <de...@evernote.com>.

No, we only use HTTP Transport.  For anything on the public Internet, 
this is the only way to go ... it also gives you lots of extra 
advantages like client firewall support, hardware load balancing, SSL 
"for free", etc.  When we were adopting Thrift three years ago, I did 
some synthetic load tests to compare the overhead of THttpClient 
transport versus direct binary transport.  If the HTTP stack supports 
proper HTTP Keep-Alive, the overhead was negligible (under 20%).  
Unfortunately, several languages don't do proper keep-alive in their 
HTTP libraries by default, so your mileage may vary drastically.

We mitigate against Thrift-related denial of services through a mix of 
measures that should (hopefully) make a Thrift protocol attack less 
fruitful than other attacks.  (I.e. so that Thrift isn't the weakest link.)
For example, we use maxSkipDepth() to avoid bogus sequences of nested 
structures:
http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/java/src/org/apache/thrift/protocol/TProtocolUtil.java?revision=760189&view=co
And we determine the total incoming message length via the HTTP 
Content-Length header to reject big messages before parsing, and use 
this as a limit to TBinaryProtocol.setReadLength() to automatically 
reject bogus object length/size fields:
http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/java/src/org/apache/thrift/protocol/TBinaryProtocol.java?view=co

Our use of Thrift is obviously a bit unusual compared to most folks 
using it for internal server-server communications, but we have millions 
of distinct client machines talking Thrift to Evernote every month, so I 
can vouch that it works.

On 6/11/10 9:37 AM, Bjørn Borud wrote:
> On Fri, Jun 11, 2010 at 5:32 PM, Dave Engberg<de...@evernote.com>  wrote:
>
>    
>> Evernote uses Thrift for all client-server communications, including
>> third-party API integrations (http://www.evernote.com/about/developer/api/).
>>   We serialize messages up to 55MB via Thrift.  This is very efficient on the
>> wire, but marshalling and unmarshalling objects can take a fair amount of
>> RAM due to various temporary buffers built into the networking and IO
>> runtime libraries.
>>
>>      
> do you use TFramedTransport?  if so, I would assume that you have set the
> frame size to 55Mb avoid the OOM error problems?  I've been thinking a bit
> about this lately since I may want to expose a Thrift API to the outside
> world.  Not setting a limit makes is exceptionally susceptible to
> denial-of-service (just connect a socket and say "asdf" and boom).  Setting
> the limit too high would require about 5 minutes more hacking to create a
> program that sucks up lots of resources on the server.
>
> (I guess this problem is also why TFramedTransport avoids using
> direct-allocated ByteBuffer?)
>
> One improvement would be to have the ability to do sanity checks on frames
> over a certain size -- so that connections writing bogus data can be killed
> off early.  But it isn't a quick fix and I am not entirely convinced that it
> is worthwhile either.
>
> -Bjørn
>
>

Re: Serializing large data sets

Posted by Bjørn Borud <bb...@gmail.com>.

On Fri, Jun 11, 2010 at 5:32 PM, Dave Engberg <de...@evernote.com> wrote:

>
> Evernote uses Thrift for all client-server communications, including
> third-party API integrations (http://www.evernote.com/about/developer/api/).
>  We serialize messages up to 55MB via Thrift.  This is very efficient on the
> wire, but marshalling and unmarshalling objects can take a fair amount of
> RAM due to various temporary buffers built into the networking and IO
> runtime libraries.
>

do you use TFramedTransport?  if so, I would assume that you have set the
frame size to 55Mb avoid the OOM error problems?  I've been thinking a bit
about this lately since I may want to expose a Thrift API to the outside
world.  Not setting a limit makes is exceptionally susceptible to
denial-of-service (just connect a socket and say "asdf" and boom).  Setting
the limit too high would require about 5 minutes more hacking to create a
program that sucks up lots of resources on the server.

(I guess this problem is also why TFramedTransport avoids using
direct-allocated ByteBuffer?)

One improvement would be to have the ability to do sanity checks on frames
over a certain size -- so that connections writing bogus data can be killed
off early.  But it isn't a quick fix and I am not entirely convinced that it
is worthwhile either.

-Bjørn

Re: Serializing large data sets

Posted by Dave Engberg <de...@evernote.com>.

Evernote uses Thrift for all client-server communications, including 
third-party API integrations 
(http://www.evernote.com/about/developer/api/).  We serialize messages 
up to 55MB via Thrift.  This is very efficient on the wire, but 
marshalling and unmarshalling objects can take a fair amount of RAM due 
to various temporary buffers built into the networking and IO runtime 
libraries.

On 6/11/10 8:26 AM, Abhay M wrote:
> Hi,
>
> Are there any know concerns with serializing large data sets with Thrift? I
> am looking to serialize messages with 10-150K records, sometimes resulting
> in ~30M per message. These messages are serialized for storage.
>
> I have been experimenting with Google protobuf and saw this in the
> documentation (
> http://code.google.com/apis/protocolbuffers/docs/techniques.html) -
> "Protocol Buffers are not designed to handle large messages. As a general
> rule of thumb, if you are dealing in messages larger than a megabyte each,
> it may be time to consider an alternate strategy."
> FWIW, I did switch to delimited write/parse API (Java only) as recommended
> in the doc and it works well. But, Python protobuf impl lacks this API and
> is slow.
>
> Thanks
> Abhay
>
>