<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>StyleFeeder Tech Blog &#187; Database</title>
	<atom:link href="http://blog.tech.stylefeeder.com/category/database/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.tech.stylefeeder.com</link>
	<description>Bitheads Invade the Fashion World</description>
	<lastBuildDate>Mon, 02 Nov 2009 17:01:10 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Moving to another cloud</title>
		<link>http://blog.tech.stylefeeder.com/2009/08/25/contegix-cloud-computing/</link>
		<comments>http://blog.tech.stylefeeder.com/2009/08/25/contegix-cloud-computing/#comments</comments>
		<pubDate>Tue, 25 Aug 2009 21:48:20 +0000</pubDate>
		<dc:creator>Philip Jacob</dc:creator>
				<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://blog.tech.stylefeeder.com/?p=225</guid>
		<description><![CDATA[We are in the process of migrating one of our backend dataprocessing servers from a legacy hosting company in NYC to Contegix.  What&#8217;s unusual about this transition is that we&#8217;re moving the machine onto Contegix&#8217;s new cloud platform rather to a traditional server.  We&#8217;ve noticed a few things already.  When we were copying over a [...]]]></description>
			<content:encoded><![CDATA[<p>We are in the process of migrating one of our backend dataprocessing servers from a legacy hosting company in NYC to Contegix.  What&#8217;s unusual about this transition is that we&#8217;re moving the machine onto Contegix&#8217;s new cloud platform rather to a traditional server.  We&#8217;ve noticed a few things already.  When we were copying over a huge backup of our databases, we noticed that they were transferring across the network from NYC to St Louis at 93Mbps, which is <em>not frigging bad</em>!  As I write this, we&#8217;re loading over 100Gb of data into a MySQL server on our new Contegix cloud machine at ~30K blocks/second (as measured by vmstat), which means that this thing has lightning fast i/o&#8230; not surprising since the storage is on an EqualLogic SAN (<strong>Update</strong>: we later saw this increase to ~70K blocks/second).</p>
<p>The differences between this cloud platform and EC2 (which we still use for some other needs) are striking.  The application that we will host on this new vm sometimes needs a lot of memory.  With Contegix, we can grow that all the way up to 128Gb with 32 cores.  Amazon doesn&#8217;t even come close to that &#8211; their max is 15Gb.  Or you can figure out how to distribute your application over a bunch of hosts.  But sometimes you just need 20Gb of memory and all the problems go away.  Plus we don&#8217;t have to compete for these resources &#8211; they&#8217;re guaranteed to us.</p>
<p>I also like the fact that the machine doesn&#8217;t disappear into oblivion when it reboots, which is a feature (?) of EC2 instances.  We can grow our storage needs past that point that I care to think about on this platform as well.  Plus, we get all the Contegix support that we want if we choose to do crazy things with this host.</p>
<p>The virtualization technology is VMWare ESX, which is darn cool stuff (having just set it up on an integration server here a week or so ago, I have to say that I like what I have seen so far).  We&#8217;ve already seen our VM get hot-migrated to another physical box in order to maximize the resources available to us.  Things got slow for a little bit, but then they got lightning fast.  I think we were copying data into the machine at that point and saw no impact to open connections, etc.  Don&#8217;t ask me why, but I&#8217;m still surprised that this works reliably.</p>
<p>So far so good.  We&#8217;ll report back with more later.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.tech.stylefeeder.com/2009/08/25/contegix-cloud-computing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Memcached vs MySQL</title>
		<link>http://blog.tech.stylefeeder.com/2008/08/22/memcached-vs-mysql/</link>
		<comments>http://blog.tech.stylefeeder.com/2008/08/22/memcached-vs-mysql/#comments</comments>
		<pubDate>Fri, 22 Aug 2008 20:48:21 +0000</pubDate>
		<dc:creator>Philip Jacob</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[caching]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://blog.tech.stylefeeder.com/?p=41</guid>
		<description><![CDATA[I recently had lunch with Dan Weinreb who I met at the Xconomy cloud computing event back in June.  We talked about many topics, mostly scalable database architectures, but also about caching.  He mentioned that he was doing some stuff with memcached lately, which I found very interesting.  Now, memcached certainly has some nice features, [...]]]></description>
			<content:encoded><![CDATA[<p>I recently had lunch with <a href="http://danweinreb.org/blog/">Dan Weinreb</a> who I met at the Xconomy <a href="http://blog.tech.stylefeeder.com/2008/06/24/cloud-computing/">cloud computing</a> event back in June.  We talked about many topics, mostly scalable database architectures, but also about caching.  He mentioned that he was doing some stuff with <a href="http://www.danga.com/memcached/">memcached</a> lately, which I found very interesting.  Now, memcached certainly has some nice features, but I mentioned to him that I found its performance to be surprisingly lackluster.  But people still rave about it and use it in really big installations (i.e. Facebook).  Yes, we do use memcached in production at <a href="http://www.stylefeeder.com/">StyleFeeder</a>, but it&#8217;s not in widespread use.  Instead, we rely on sharding our data across 100 MySQL databases.  This works really well for a number of reasons, not least of which is the fact that we cannot fit all of our data in memory cost effectively.  We also have stringent performance requirements for our site, which means that we need to have very simple data access paths.  Most pages on our site can be loaded with one single database query.</p>
<p>Dan mentioned that someone he knows did some basic benchmarks that clocked in around 700 requests per second.  I wanted to see what our numbers were like.</p>
<p>(Before I share these numbers, I want to emphasize that I&#8217;m not ready to hang my hat on these numbers yet, but I figured I&#8217;d share them for comments.)</p>
<p>100,000 get requests executed serially:</p>
<p>Memcached: Requests per second: 684<br />
MySQL: Requests per second: 884</p>
<p>Surprising, eh?  This is for the same data coming out of one our shards and the same data coming out of memcached.</p>
<p>I have more unanswered questions: instead of doing this serially, what happens when I have 20 concurrent threads pulling data out?  Does the memcached client library make a big difference?</p>
<p>I also wonder in what cases it makes sense to use memcached.  If you&#8217;re like us and have more data than you can reasonably hold in memory, you probably can&#8217;t use memcached unless you&#8217;re able to hit your main data store without a big penalty.  If you have an amount of data that can fit in memory, you should use something like Whirlycache (only relevant if you&#8217;re using a jvm), which did 2,500 requests per <strong>millisecond</strong> for the same test.</p>
<p>If you simply need to share data across a wide range of nodes, does memcached even make sense at that point?  Perhaps in the case of a more dynamic architecture, memcached and <a href="http://www.last.fm/user/RJ/journal/2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients">libketama</a> are pretty key.  Rigging that machinery manually with a MySQL backend is possible, but not the kind of thing you&#8217;d want to focus on unless you&#8217;re doing systems work.</p>
<p>I&#8217;m curious to hear what people think, because there&#8217;s certainly a lot of conventional wisdom behind memcached that I can&#8217;t understand right now.  Francois seems to be <a href="http://fschiettecatte.wordpress.com/2008/08/07/memcached-again/">in the same camp</a> as me.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.tech.stylefeeder.com/2008/08/22/memcached-vs-mysql/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>MySQL Analysis Tools</title>
		<link>http://blog.tech.stylefeeder.com/2008/07/15/mysql-analysis-tools/</link>
		<comments>http://blog.tech.stylefeeder.com/2008/07/15/mysql-analysis-tools/#comments</comments>
		<pubDate>Tue, 15 Jul 2008 19:59:13 +0000</pubDate>
		<dc:creator>kilby</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[mysql innodb innotop mysqlreport performance]]></category>

		<guid isPermaLink="false">http://blog.tech.stylefeeder.com/?p=25</guid>
		<description><![CDATA[I&#8217;ve been having some odd performance issues with some of my MySQL queries since moving over to InnoDB.  I picked up the new O&#8217;Reilly title High Performance MySQL to try to track down the problem.  The book in turn recommends a couple of pretty cool monitoring/reporting tools that summarize a lot of the MySQL variable [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been having some odd performance issues with some of my <a href="http://www.mysql.com">MySQL</a> queries since moving over to InnoDB.  I picked up the new O&#8217;Reilly title <a href="http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716/ref=pd_bbs_sr_1?ie=UTF8&amp;tag=stylefeeder-20&amp;s=books&amp;qid=1216151187&amp;sr=8-1">High Performance MySQL</a> to try to track down the problem.  The book in turn recommends a couple of pretty cool monitoring/reporting tools that summarize a lot of the <a href="http://www.mysql.com">MySQL</a> variable displays in a more friendly format.</p>
<p><a href="http://sourceforge.net/projects/innotop">Innotop </a>(more info <a href="http://www.xaprb.com/blog/2006/07/02/innotop-mysql-innodb-monitor/">here</a>) is sort of like the friendly unix top command, but instead for database status.  There are pages to show buffer statuses, deadlocks, i/o status, current queries, and lots more.  They all update on screen, at configurable increments.</p>
<p><a href="http://hackmysql.com/">MySQLReport</a> from <a href="http://hackmysql.com/">hackmysql.com</a> runs a few status commands and formats them nicely on screen in a nice grouped format.  <a href="http://www.hackmysql.com/mysqlreportguide">This guide</a> summarizes the sections, which include more detail on many of the same things that innotop covered.  I find the sections on SELECT types, and InnoDB Buffer Pool use, are especially useful to me.</p>
<p>Using the command type summary, we discovered an inordinate number of com_rollback calls in our main database, which we were able to reduce by using <a href="http://forums.mysql.com/read.php?39,200681,200696#msg-200696">this technique</a>.  The root cause was <a href="http://www.hibernate.org">Hibernate</a>&#8217;s love of transactions, combined with connection pooling.  A simple driver parameter seems to clear it up.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.tech.stylefeeder.com/2008/07/15/mysql-analysis-tools/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Generating Primary Keys</title>
		<link>http://blog.tech.stylefeeder.com/2008/05/27/generating-primary-keys/</link>
		<comments>http://blog.tech.stylefeeder.com/2008/05/27/generating-primary-keys/#comments</comments>
		<pubDate>Tue, 27 May 2008 16:12:54 +0000</pubDate>
		<dc:creator>Jason Rennie</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[primary key]]></category>

		<guid isPermaLink="false">http://blog.tech.stylefeeder.com/?p=14</guid>
		<description><![CDATA[
A primary key for each row of a table in a database is virtually a requirement of database design.  Occasionally, the data for a table provides a primary key (e.g. username or email for an account table).  More common is that one needs to generate primary key values for a table.  Yet, [...]]]></description>
			<content:encoded><![CDATA[<p>
A primary key for each row of a table in a database is virtually a requirement of database design.  Occasionally, the data for a table provides a primary key (e.g. username or email for an account table).  More common is that one needs to generate primary key values for a table.  Yet, tools for this in MySQL/Java are limited.  MySQL offers auto_increment, but there are issues with replication, it can become a bottleneck for insert-heavy tables, it doesn&#8217;t provide globally unique ids and displaying these ids publicly may expose sensitive information.  Java offers java.util.uuid, which gives pseudo-random 128-bit values.  The chance of a collision is minuscule, but non-zero.  More troubling is the size of the string representation: 36 characters.  Since InnoDB uses the primary key index as the storage structure for the data and uses primary keys as data pointers for secondary indexes, long keys not only waste space, but make the database less efficient.
</p>
<p>
After evaluating these options and a few ideas of our own for primary key generation, we settled on a simple algorithm motivated by group theory.  The advantages of this algorithm are numerous:</p>
<ul>
<li>Short Keys (6 characters yield 57 billion unique keys using only alphanumeric characters)
<li>Universal Uniqueness (no guessing to which table a key value refers)
<li>Pseudo-randomness (keys don&#8217;t follow an obvious pattern)
<li>No Duplicate-Checking (keys are guaranteed to be unique until a limit is reached)
<li>Block Generation (keys are generated in blocks to minimize lock contention)
</ul>
</p>
<p>
Our generator uses one tiny bit of group theory: if k and n are coprime (aka relatively prime), the sequence of numbers generating by successively adding k (mod n) will not repeat through the first n values.  This leads to the following algorithm for generating unique keys:</p>
<ul>
<li>Pick a size n
<li>Pick a value k which is coprime with n
<li>To generate the next key: nextKey = (lastKey + k) % n
</ul>
<p>You&#8217;ll be guaranteed to not see duplicates until you&#8217;ve generated n keys.  The sequence you&#8217;d see with n=5 and k=3 is { 0, 3, 1, 4, 2, 0, 3, &#8230; }.
</p>
<p>
Note that the choices of n and k are quite important&#8212;they must be fixed and can never change.  However, selecting reasonable values is not difficult.  For n, select a character set and string length, then set n to be the number of possible unique strings.  To get the 57 billion value above, use a string length of 6 and a character set of [0-9a-zA-Z] (62 characters).  57 billion is simply the number of unique, 6 character alphanumeric strings (62^6).  If you grow to the point that you are worried about key collisions, switch to using 7 character strings (where n=62^7, appx. 3.5 trillion).  Note that conversion from the key number value to string value is simply a conversion from base 10 to base 62 (or whatever # of characters you are using).
</p>
<p>
For k, we need a value that is coprime with n.  To achieve pseudo-randomness, k should also not be too small (the same order as n is a good choice).  Note that this &#8220;randomness&#8221; is quite weak in a mathematical sense, but was sufficient for our purposes.  One way to select such a k is to multiply together prime numbers larger than the character set size.  For our example, a reasonable choice would be k=67*71*73*79*83*89.  If you don&#8217;t have your own prime number generator, consult <a href="http://alpha61.com/primenumbershittingbear/">the bear</a>.
</p>
<p>
To put this algorithm into practice, one needs to ensure that keys are generated serially.  We did this by creating a table with a single row with a single column storing the last key value.  When we want to generate a key (or block of keys), we start a SERIALIZABLE transaction, read the last key value, generate key(s) per the above algorithm, then write back the last key value we generated and close the transaction.  To minimize contention and since next key computation is much faster than a transaction, we generate keys in blocks and serve them out of memory via a synchronized HashMap.  This causes key values to occasionally be permanently lost when a webapp is shut down, but the lossage is too small to be of any real concern.
</p>
<p>
We&#8217;ve been using this system for many months now and have yet to run into any problems.  It satisfies all of our current needs and has the advantage that it can easily scale either by using longer character strings or increasing the key generation block size.  Furthermore, it seems to be extremely lightweight, exerting minimal pressure on our database.  We would love to hear what other solutions for primary key generation are used.  How does ours compare?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.tech.stylefeeder.com/2008/05/27/generating-primary-keys/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
