Scaling Twitter: “Is Twitter is UDP or TCP? Its definitely UDP.”

Presented by Blaine Cook, a developer from Odeo, now probably CTO of Twitter (Obvious Corp spawed, I think). There’s a video and slides (yes, you need evil Flash so I haven’t viewed it myself). Then there are my notes… possibly with some thoughts attached to them. No, they’re not organized, I’m too busy and tired…

Rails scales, but not out of the box. This will cause Twitter to stop working very quickly.

600 requests/second, 180 rails instances (mongrel), 1 DB server (MySQL) + 1 slave (read only slave, for statistics purposes), 30-odd processes for misc. jobs, 8 Sun X4100s.

Uncached requests in less than 200ms in most of the time.

steps:
1. realize your site is slow
2. optimize the database
3. “Cache the hell out of everything”
4. scale messaging
5. deal with abuse
6. profit.

Have stats (something Twitter didn’t have before): munin, nagios, awstats/google analytics (latter doesn’t obviously work if your site itself doesn’t load), exception notifier/logger (exception logger is what they use at Twitter, so you don’t get lots of email :P). You need reporting to track problems.

Benchmarks – they don’t do profiling, they just rely on their users! What torture for the poor users…

“The next application I build is going to be easily partionable.” – Stewart Butterfield
Dealing with abusers…
Inverse spamming – The Italians – receiving SMS gives you free call credits!
9,000 friends in 24 hours doesn’t scale!
Just be ruthless, delete said users. This is where you thank the reporting tools, to allow you to detect abusers.

They’ve looked at Magic Multi Connections, it looks great, but it wouldn’t work for Twitter.

Main bottleneck is really in DRb and template generation. Template optimizer that Steven Kays wrote doesn’t work for them.

Twitter: built by 2 people first. And now, they’re just 3 developers.

When mongrels hit swap, they become useless. So turn swap off.

Twitter themselves don’t seem to want to give out details of how many users, etc. they have. Shifty, beyond the fact that they claim its “a lot of users”.

Twitter is not built for partitioning. Social applications should be designed to be easily partionable. WordPress, anything 37signals builds, tends to be partionable. Things start becoming hairy when you have 3,000+ friends!

Index everything – Rails won’t do it for you, but you need to repeat for any column that appears in a WHERE clause.

Denormalize a lot – heresy in the rails book? but he hopes not. This is single handedly what saved Twitter.

They use InnoDB. Don’t do status.count() when there’s millions of rows… it’ll stop working. MyISAM will be faster, but still, don’t.

email like “$#!$” – search. Twitter has disabled search right now… This makes their database enjoy life.

Average DB time is 50ms (to at most 100ms)

They’re not hurting on the DB. The master DB machine is at a quarter CPU usage. So they don’t see the need to partition at this point.

Twitter does a lot of caching, they use MemCache. If you really need status.count() use memcache.

Query for friends status on your Twitter homepage, is a complicated query using a lot of JOIN. They use ActiveRecord, they store the status in memory, and they don’t touch the DB. They plan to use memcache in the future for the statuses too.

ActiveRecord objects are huge (which is why its not stuck in memcache yet). They’re looking at implementing ActiveRecord nano or something simiar – smaller, store in cache critical attributes, and use add method missing if you don’t find what you’re looking for.

90% of Twitter’s requests are API requests. So cache them. No fragment or page caching on the front-end, but for API requests, lots of caching.

Producer(s) -> Message Queue -> Consumer(s)

DRb: zero redundancy, tightly coupled.

They use ejabberd for Jabber server.

When the Jabber client went down, everything went down. So they moved to using Rinda. Its O(N) for take() so if the queue has 70,000 messages, you just shut it down, restart it, and lose those 70,000 messages. Sigh.

“Someone asked if Twitter is UDP or TCP? Its definitely UDP.” — Blaine Cook

LiveJournal has a horizontally scaled MySQL, that is just MySQL + Lightweight Locking. RabbitMQ (erlang) is something they’re looking at, quite clearly, but it looks ugly, and they don’t want to possibly implement it.

Starling was written. Ruby, will be ported to something faster. Does 4000 transactional messages/second, will have multiple queues (like a cache invalidation one), speakes MemCache (set, get), writes it all to disk. First pass was written in 4 hours, and its been working fine for the last few days (i.e. since Wednesday). Twitter died on Tuesday at the Web 2.0 conference! Starling will probably be open source.

Use messages to invalidate your cache.

Dealing with abusers…
Inverse spamming – The Italians – receiving SMS gives you free call credits!
9,000 friends in 24 hours doesn’t scale!
Just be ruthless, delete said users. This is where you thank the reporting tools, to allow you to detect abusers.

They’ve looked at Magic Multi Connections, it looks great, but it wouldn’t work for Twitter.

Main bottleneck is really in DRb and template generation. Template optimizer that Steven Kays wrote doesn’t work for them.

Twitter: built by 2 people first. And now, they’re just 3 developers.

When mongrels hit swap, they become useless. So turn swap off.

Twitter themselves don’t seem to want to give out details of how many users, etc. they have. Shifty, beyond the fact that they claim its “a lot of users”.

Technorati Tags: , , , ,

4 Comments

  1. wahlau says:

    how come you have some repeated contents?

    thanks for the post. i learnt quite some about scalability issues…

  2. James says:

    The video is on google video, so you can download an AVI to watch it: http://video.google.com.au/videoplay?docid=-7846959339830379167

  3. Blaine says:

    Great notes, thanks for writing these up!

    Let me know if you have any other questions, or if anything was unclear from the talk.

  4. byte says:

    @wahlau: repeated? How so? Like I said, I didn’t edit it…

    @james: Thanks! Running x86_64 linux means no official Flash plugin, which means grief as swfdec and gnash get up to scratch.

    @Blaine: It was my pleasure. I’ve been in touch with Jack Dorsey as well, in case we at mysql can help!


i