System Design Basics

‘So which database would you use- SQL or NoSql ?’

No matter which system you are asked to design in a system design interview, this question is going to pop up for sure.

Instead of just noting down what are the differences between them, let's first know how these databases work. After understanding their features and how they store data, one can easily understand which database is suitable for a given use case.

Watch this 21 min video to understand the following three concepts-

  • Distributed data stores
  • RDBMS scaling problems
  • CAP theorem

1. Cassandra

First, watch this video by Gaurav sen to understand basically ‘How NoSQL databases like Cassandra works’

So basically Cassandra is a key-value store where keys are hashed to find which node to store the key-value pair.

Following diagram shows basically how Cassandra stores data-

Records in Cassandra are stored as lists of key-value pairs where the column name is the key.

Understand the following features of Cassandra-

  • In Cassandra, writes are (almost) free-
  • Why?

    In relational databases, whenever a write operation has to be performed, locks are used to ensure consistency. While NoSQL databases give availability more priority than consistency. Hence whenever a write operation has to be performed, using consistent hashing the node in which the key-value pair has to be stored is determined.

    Owing to Cassandra’s architecture, writes are shockingly fast compared to relational databases. Write latency may be in the hundreds of microseconds and large production clusters can support millions of writes-per-second.

    But to make one thing clear, Cassandra can be configured to guarantee consistency. As discussed in the above Gaurav sen's video, Quorum is discussed which can be used to ensure consistency.

  • In Cassandra, denormalization is expected-
  • In RDBMS, it is encouraged to store data in a normalized form and further if required joins can be performed. But joins may increase latency to serve requests. Joins are not expected, rather are impossible in Nosql databases. Using more space is affordable than degrading user performance. Hence data is stored in a denormalized format in Cassandra or in any other NoSQL databases.

  • In Cassandra, store data such that no joins would be required!

To understand Cassandra better, we highly encourage you to read this white paper on ‘data modeling in Cassandra’. It also includes an example use case of video service, and how Cassandra is used to storing data. Understanding the above example will give a foundation on how engineers actually carry out data modeling.

Hence, Cassandra would be preferable when we want Availability and Partition tolerance.

As Cassandra represents the availability-partition tolerance side of the triangle, It won’t be a good solution in situations when consistency matters. For example, in any messaging service, or in google docs where synchronization is important, we must have a database which ensures consistency and partition tolerance.

Which database should we use to ensure Consistency and Partition tolerance?

Now one might come up with MySQL because it ensures consistency. But Relational databases are difficult to scale hence they don’t readily support Partition Tolerance.

Achieving scalability and elasticity is a huge challenge for relational databases. Relational databases were designed in a period when data could be kept small, neat, and orderly. That’s just not true anymore. Relational databases are designed to scale up on expensive single machines.

Relational databases provide solid, mature services according to the ACID properties. We get transaction-handling, efficient logging to enable recovery, etc. These are core services of the relational databases, and the ones that they are good at.

But relational databases are hard to customize and might be considered as a bottleneck, especially if you don't need them in a given application. Lots of "big data" problems don't require these strict constraints, for example, web analytics, web search, or processing moving object trajectories, as they already include uncertainty by nature.

When reaching the limits of a given computer (memory, CPU, disk: the data is too big, or data processing is too complex and costly), distributing the service is a good idea. Lots of relational and NoSQL databases offer distributed storage. In this case, however, ACID turns out to be difficult to satisfy: the CAP theorem states somewhat similar, that availability, consistency, and partition tolerance can not be achieved at the same time. If we give up ACID (satisfying BASE for example), scalability might be increased.

Now in order to scale relational databases, the following methods can be applied-

  • Master-slave partitioning where writes are committed to a single master node that distributes updated data to read-only slaves. It won't speed up writes which are the bottleneck usually.
  • Sharding where different data is stored on distinct nodes. In the above banking example, the first node may store accounts of people whose name starts with A-M, the other N-Z. It makes your app more complex by shifting the burden of scaling higher on the stack.

But the partitioning logic has to be handled at the Application layer. i.e. when application servers receive a request, the code in the application layer defines on which database server node the write or read operation has to be forwarded. But NoSQL databases are designed for scalability at large. Hence not a lot of effort is required to scale NoSQL databases.

Hence, MySQL would be preferable when we want Consistency and Availability.

But then a big question arises-

Facebook, youtube, stack overflow still use Mysql databases. Now platforms like these have billions of daily active users, why don’t they use NoSQL databases?

If you are interested to find out some real-world fact, we would strongly suggest reading the following articles in the given order-

According to article 1, Stonebraker says that ‘Not that it’s necessarily Facebook’s fault, though. The social network’s predicament is all too common among web startups that start small and grow to epic proportions’

Further after getting the response for the above, in article 2 Derrick Harissin says-

‘Facebook’s implementation is a sign of the times in which it was built, but the evidence suggests that if Facebook could do it over again with today’s database options, it wouldn’t go down the same path. Sharding MySQL thousands of times, operating thousands of Memcached servers, and paying a team of crack engineers to keep it scaling is nobody’s idea of fun.’

But why would companies like Facebook and Twitter still use the Mysql database? Article 3 gives a few reasons.

With all this discussion, it won't be wrong if you still suggest using the NoSQL database. Let’s say we would use Cassandra in our system. As we are using Cassandra, the indexes which we already discussed, would be used as keys for consistent hashing.

But if we were designing for let’s say a small user base, using Mysql won't be a problem. As partition tolerance won't be required and Mysql supports availability and consistency well.

Facebook’s MySQL team is supremely smart or that it does a great job innovating to ensure that the database is able to keep up with the site’s transactions. And as the MySQL community has developed more than the NoSQL community, companies still use MySQL databases instead of shifting to NoSQL databases.

But what if we want to develop an application that requires consistency and partition tolerance? Would we start with Mysql and then pay our engineers to make it scale?

Now we have another option which ensures both consistency and partition tolerance which is Hbase or Bigtable. If you are interested to know how HBase works - please read this article.

With HBase, you have a “query-first” schema design; all possible queries should be identified first, and the schema model designed accordingly. You should design your HBase schema to take advantage of the strengths of HBase.

Similar to Hbase, BigTable also ensures consistency and partition tolerance. HBase is a clone of BigTable. Their design philosophies are almost totally the same. Hbase and Bigtable are based on similar concepts with the following differences

  • HBase is open-sourced, BigTable is not.
  • BigTable is written with c++, HBase is written with Java.
  • Currently, BigTable has richer features than HBase.
  • BigTable supports transactions. Although HBase supports single row lock, transactions across multiple rows are not supported naturally currently.
  • BigTable's secondary index building is much more mature than HBase.

But the development speed of HBase is very fast. Many top developers are contributing to this project. So it is very possible HBase will catch up with or even surpass Bigtable in a few years.

Hence, HBase would be preferable when we want Consistency and Partition tolerance.

Tip: While designing, it is important to take into account to what extent do we want to scale our system. If the daily active user count is less, then relational databases are good because then partition tolerance won’t be a strong requirement and we can get the benefit of both consistency and availability. Hence the traffic and storage estimations part of the interview is important (step 3 in our System Design Examples) . These numbers actually help us to decide which databases would be good to go for the given scale.

Imp Interview Questions

  • What is NoSQL and Why Does it Exist?
  • SQL vs. NoSQL Scalability
  • When to use SQL instead of NoSQL
  • When to use NoSQL instead of SQL
  • When to Use Both SQL and NoSQL
  • Which database is right for your application and team?

Finally, read this article to understand the following concepts-