Archive for the ‘Databases’ Category

MySQL at Google

MySQL: The Real Grid Database, Mark Callaghan, Chip Turner

A tremendous amount of work also done by Wei Li and Gene Pang.

Google has a large MySQL deployment, and they enhance it as needed.

MySQL@Google: too many queries, transactions, data, and rapid growth. Real workload with OLTP and reporting. Workload at Google is *critical*.

The well known solution is to deploy a “grid database”:

  • use many replicas to scale read performance
  • shard your data over many masters to scale write performance (vertical partitioning of data)
  • sharding is easy, resharding is hard

Large number of small servers, not much capacity lost when a server fails, support as many servers as possible with a few DBAs.

Manageability is important at Google – make all tasks scriptable. Gives you time to solve more interesting problems, and also support hundreds of servers with one DBA.

Google prefers under-utilizing servers – better to have 3 servers at 50%, rather than 2 servers at 75%. Less maintenance, less tuning, load spikes tolerated better.

Monitor everything you can: vmstat, iostat, MySQL error logs, /var/log/messages, SHOW STATUS & SHOW PROCESSLIST output, etc. Archive it for as long as you can. And automate all this as much as possible. Allow to query/visualize data from the archive. There are tools out there, and Google has some internal ones that they use.

They tend to not store the logs in a database. Its more efficient to bzip it or something (I’m thinking the ARCHIVE storage engine might be appropriate for them, possibly).

Many fast queries can be as much of a problem as one slow query.

mypgrep is an open source tool that Google has released.

Changed MySQL to count activity per account, table, and index:

  • SHOW USER_STATISTICS – Displays per account activity
  • SHOW TABLE_STATISTICS – each table, number of rows fetched/changed
  • SHOW INDEX_STATISTICS – rows fetched per index, find indexes that were never used

MySQL High Availability, with even brighter future – with DRBD for instance. Cluster and Replication rock, however they need features right now. Committed to InnoDB and MySQL Replication

Zero transaction loss on failures of a master. Minimal downtime on failures of a master (if they can get downtime within 1-2 minutes, they’ll be happy). Reasonable cost in performance (added latency for a single workload), and dollars.

Readers and writers don’t lock each other.

Failure happens everywhere: OS (kernel OOM or panic), still running a lot of 32-bit servers, mysqld crashes due to code they themselves write, corrupted write (so InnoDB checksums rock), file system becomes inconsistent after an unplanned reboot (they use ext2), bad RAM, people (! rebooting by mistake :) ).

Features they want to see in MySQL: synchronous replication as an option, product that watches a master and initiates a failover, archives of the master’s binlog stored elsewhere (in case the master becomes unreachable), state stored in the filesystem to be consistent after a crash (a more modern FS besides ext2 will make this better).

And everyone’s been pretty much talking about how Google is contributing to MySQL, with some interesting twists to why they’re contributing back and open sourcing their code. google-mysql-tools.

Technorati Tags: , , , , , ,

Extreme Makeover: Database or MySQL@YouTube

Arguably one of the most interesting keynotes (and technical to boot!), Paul Tuckfield not only entertained us in his 40 minute keynote, he also did so outside when the keynotes ended.

Just the DBA at PayPal, just the DBA at YouTube. Only 3 DBAs at YouTube that make it all happen. Only a MySQLer for ~8 months (Oracle for ~15 years). So guess PayPal is a Oracle shop.

MySQL is one (important) piece of the scalability picture.

Technologies: Python, Memcache, MySQL replication. Praises Python, a lot (its much quicker, than C++, to implement goodness).

Click tracking on a separate MyISAM site. But Read/write on InnoDB, using replication. Far more reads than writes at YouTube

4x2ghz Opteron core, 16GB ram, 12x10krpm scsi – constantly crashing, replication saved them

5.0 “mystery cache hits” – when you export and import (mysqldump and load back into 5.0), you boost your performance, rather than if you upgrade in place, because there’s a compact row format. They moved from 4.1 -> 5.0.

Cache is king. Writes, cache by RAID controller rather than the OS. Only the DB should cache reads (not raid, not linux buffer cache)

Software striping atop hardware array.

The oracle caching algorithm – in academia. Not something I’ve heard much about, and definitely need to look into it further.

The talk was too long, but would make a most interesting read, and an actual presentation rather than a keynote. I hope his presentation makes it online, sometime soon.

Note-to-entrepreneurs: If building a web business, and you want to be acquired by Google, its quite largely possible that their due diligence includes “python” compatibility. Most of their released tools, are all python-related or based. Oh, and make sure you use commodity hardware (in fact, do that if you want to get VC funded, even.)

Update: A little note on the oracle algorithm. If anyone has papers, and more credible links, please do drop me a line.

Technorati Tags: , , , , , ,

Lightning talks with Community Contributors

I think this was a really interesting talk (because of all the contributors talking), and my only minor complaint was that it was up against some really good talks, and we didn’t get more people showing up to a talk that was very largely on the great Architecture of Participation. It also is interesting, as it goes to show that blogging can get you good rewards – most of everyone listed below, is a somewhat active blogger.

Martin Friebe – bug reports, patches
Why? Its just cool to contribute. Improves your knowledge. MySQL rewards you (named on the website, Enterprise, etc.).
How? Write code. Look for limitations. Just use MySQL.

Peter Zaitsev
Hates submitting bugs, but he needs a bug free MySQL for himself and customers. Therefore, report them, and scream loud!
Be an early adopter.
Regular hardware, for storage engine benchmarks. Patches, and other cool bits for MySQL.

Sheeri Kritzer – blogger, user group meetings, podcasts
Bugs, but contributing is not only technical. “Just do it” (in terms of user groups)
You set your own deadlines, and you look like a hero when you’re a community member, as opposed to it being your job.
Don’t overcommit: back out earlier, rather than later
“chronic volunteer”

Paul McCullagh – PBXT storage engine
While testing PBXT, he found a few bugs, and thats how he became a Quality Contributor. He didn’t get such a status by writing PBXT. I do think thats wrong, and maybe MySQL needs to drop Quality, and just have it as the Contributor program?

Baron Schwartz – innotop, blogger
MySQL is not perfect, and he misses a lot of Microsoft SQL Server’s tools. His motto is “don’t complain, do something about it.” And the opportunity is obvious.
innotop started as an InnoDB transaction monitor, sort of like mytop for InnoDB.
Next, MySQL Toolkit.

Beat Vontobel – blogger
User since 3.23, active since 5.0-alpha – lots of new features to blog about and a lot of bugs to post about. Surprised that most of his bugs got fixed very quickly. Blogging as a means of sharing knowledge.
Advises to be a customer, as bug reports are free, but if it hits the internal bug database, you’re set at it getting fixed quicker

Yoshiaki Tajika – NEC Japan, MySQL Customer Support
3 years ago, NEC began to support MySQL.
He likes the bug reporting system, as compared to the Microsoft SQL system – bug reports posted anytime, without any cost, and talking with the developers directly happen.

Mike Kruckenberg
Find things that are interesting, write about it, report it, change it.

Jeremy Cole – bug reports, patches, blogging
SHOW PROFILE in 5.0.37! DorsalSource. Builds of MySQL with patches and other interesting stuff. Go to website, upload patch, and you get builds on many different architectures.

Bill Karwin – prolific forum poster!
SHA2() patch – comment that federal government wanted sha2 support in all applications in govt. Then he felt bad, so he wrote a feature! Passes the tests, and this is how a feature got enabled.

Works at Zend Technologies, doing Zend Framework supporting MySQL. Writes articles on Forge.

Ask Bjorn Hansen
Used MySQL since around ’96-1997. Started with mSQL first! Everything thats paid for his bills for ~10 years, have relation to MySQL!
Read the *excellent* documentation (a few times)!
“File a bug a week” goal – this is way too easy. Install a new Linux distribution, read the documentation and file away!
3 underrated MySQL features: Standards!
1. Timezone support (save all your date/datetime columns in UTC)
2. Unicode support (he wants an application that he can place his name in the right way!). Tmp table memory requirements go up, but its OK…
3. Use strict mode! (STRICT_TRANS_TABLES and if brave use STRICT_ALL_TABLES)

Technorati Tags: , , , ,

A Storage Engine for Amazon S3

A Storage Engine for Amazon S3, Mark Atwood

It looks mighty interesting, as transfers to Amazon S3 are free. I think it’ll work well in America and places where bandwidth rocks, but I don’t see this working too well in Australia. Oh how I wish the Internets will improve.

Mark has got all his stuff online at A MySQL Storage Engine for AWS S3. He was also kind enough to upload most of the notes, which made my reporting easier, and don’t forget to view the presentation.

Traditional storage engines use the local disk.

Networked engines: Federated, ODBC, HTTP, MemCacheD and S3 storage engine.

What is S3?
Contents can have 1 byte to 5GB. Amazon has several petabytes of storage for you to use :-)

Owning your own disks kind of suck. Pay for storage, even before you use it.

“An empty disk costs the same as a full one” – pay a lot of money to put disks in data centers. RAID isn’t “excellent”, then what about disaster recovery?

Can’t move existing database over, and the S3 storage engine isn’t ready for a full schema yet. There are hacks that allows this, but maybe it will be available next year.

Over a billion items in a bucket, and they all come back in good time.

A bucket is fully virtually hosted, you get a SQL CMS in the MySQL server. Save your EC2 work.

S3 is very distributed (geographically) and asynchronous. Writes are replicated, so your data may be reordered (and delayed). So there are no temporal guarantees.

Use the WHERE clause – otherwise it will do a full table scan, and you’ll be paying Amazon lots of money :-)

The talk ended with about 20 minutes to spare, and I do certainly hope he hacks on it more for the next year. He’s also soliciting feedback, so try it out if you can. And now, to run to the remainder of the talk on Highly Available MySQL Cluster on Amazon EC2! Two Amazon talks, with emerging technology goodness, at the same time? Pfft.

Technorati Tags: , , , , , , ,

Federation at Flickr: A tour of the Flickr Architecture

I’ve always been a big Flickr fan, and user, and love them even more now that their 2GB limit is gone. So this was a most interesting talk, and I think photographers and Flickr users alike will find it interesting what’s behind Flickr. Dathan also has a very interesting blog.

Federation at Flickr: Doing Billions of Queries Per Day, Dathan Pattishall

Database guy – query, performance, troubleshooting, application building, etc. Previously worked at Friendster in 2003.

Master-Slave lag – unable to keep up with the demand that Flickr was having. So they have multiple masters, the had multiple single points of failure (SPOF).

Everything had to be real time. Write intensive, so more than 1 master needed. No more slave lag. Serve pages fast, with many queries. Get rid of the SPOFs and be redundant.

Need ability to make live maintenance, repair data, and so forth, without taking the site down.

Federation Key Components:

  • Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone elses’ blog
  • Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
  • PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)

Shards:

  • Slice of the main database
  • Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
  • Shard assignments are from a random number for new accounts
  • Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.

Global Ring:

  • Its a Lookup Ring for stuff that can’t be federated

Owner_id -> SHARD-ID
Photo_id -> Owner_id
Group_id -> SHARD_ID

It makes sense to hash by user, but then the notion of group. Group can be owned by many users; plus many users are contributing to the group.

This is cached in MEMCache, for about 1/2 hour.

Ticket servers – MySQL 5, low grade machines, circular replication. This is the ID generator. Globally unique ID that is unique. A lot of unique IDs are associated to each row.

Clicking a Favourite:

  • Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
  • Pulls my Information from cache, to get my shard location (say on shard-13)
  • Starts a “distributed transaction” – to answer the question: Who favourited the photo? What are my favourites?

Can ask question from any shard, and recover data. Its absolutely redundant.

To get rid of replication lag…

  • every page load, the user is assigned to a bucket
  • if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistant connections, they build connections and tear it down. Every page load thus, tests the connection.

Every users reads and writes are kept in one shard. Notion of replication lag is gone.

Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode.

Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.

Average queries per page, are 27-35 SQL statements. Favourites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.

Over 36,000 queries per second – running within capacity threshold. Burst of traffic, double 36K/qps. More than 4 billion queries per day.

Each Shard holds 400K+ users data.

“So you’re a single point of failure” — a member of the audience. He’s the main/only DBA at the moment.

Search:

  • Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
  • Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
  • Think of it such that you’ve got Lucene-like search

Hardware: EMT64 w/RHEL4, 16GB RAM, 6-disk 15K RPM RAID-10. Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files – the photos are a lot larger). 2U boxes. Each shard has~120GB of data.

“Able to serve the traffic: Priceless”

Things MySQL can do better: Multi-master replication, thread bug in INNODB for 4.1 (its already in 5.0), optimization for OR queries, bitwise, sets.

MySQL 5.0 is used in shards that have aux data, like logging, and also for generating the ticket ID on the ticket servers (no replication between them, single row per server). 64-bit ID, but when exposing to public they use a 32-bit ID to keep the URLs short.

Backup procedure: ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare. Snapshots are taken every night across the entire cluster of databases.

Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.

max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.

Schema updates? Shut down half the servers, force to other shards, push update, repeat.

32 concurrent InnoDB threads on each shard. 5% of 12TB in memory for pool, so thats about 13GB.

Linux allocates memory per thread, so you can get an OS ERROR saying you’ve run out of memory.

Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.

Technorati Tags: , , , , , ,

Monty is the world’s first MySQL fellow

Yes, you heard it here first. Monty is the world’s first MySQL Fellow. For outstanding work, being the co-Founder of MySQL AB, and for continuing excellent work. Standing ovation, might I add.

Some quick comments from the clash of the DB egos:
Heikki Tuuri – he worked for Solid, then went on to university, and learned everything about databases. He attributes it to stubbornness. Monty said he should make InnoDB open source, and he obeyed!

Mikael Ronstrom – “entrepreneurial ego in a large company”. He had a vision of what he wanted to do. Guess he was also a bit stubborn…

Jim Starkey had the most interesting hat ;-) And he probably made a couple of bucks from it…
“Does the United States still have anything to say in databases?”

Starting over is important, because you learn new things… Jim Starkey has written about 5 databases in his lifetime.

Nitro: really fast speeds (faster than MyISAM). Monty does say its not a perfect solution. Positional awareness, so any index entry knows where it is.

Update: Oops, looks like Kaj might’ve got this out first – Monty: The First MySQL Fellow.

Technorati Tags: , , , ,


i