Distributed Systems 101
What are the distributed systems?
A system where two or more machines work coherently to get a task done. In these systems, nodes (the term used for infrastructure components in distributed systems lingo) communicate via network. The ideal distributed system is the system where these nodes and the network is 100% reliable. But real world is far from ideal. The reason being high level of uncertainty and non-determinism involved in building, operating and maintaining distributed systems.
Why do we need them? (i.e. Pros)
Parallelism : If task in hand is so huge that it can’t be completed using just one available machine, so there is need to distribute the workload among multiple machines which can work on it parallelly. Think of high scale machine learning problems where enormous data crunching is required. Trying to do so on single machine can consume awful lot of time. The solution is to parallelize the workload among multiple machines which can communicate with each other and pass on the crunched data to another machine for next stage via network. Another example of this scenario is huge database which won’t fit on single machine. So idea is to partition the data among multiple nodes based on certain criteria such as primary key or location of nodes.
Inherent necessity : Some systems are inherently distributed. For example any messaging or media sharing application is inherently distributed. You need multiple machines communicating over network for the system to function. Sender or uploader and receiver or viewer are using different devices, and the messages or media are delivered via network.
Fault Tolerance : Fault tolerance is process of identifying and mitigating the faults such that entire system should not fail. Centralized compute and storage systems can be rendered useless if any fault occurs on that machine which hampers its working. Using distributed systems, there always will be some sort of backup so that even if a machine fails, the entire system won’t be affected. Fault is very wide term used for any sort of human (bug, virus etc.) or natural (earthquake etc.) incident which can happen in the system having adverse effect on it.
Availability : Availability is ability of a system to stay available and serve incoming requests. Usually it’s defined in terms of percentiles (99% availability means system will be unavailable for 3.6 days in a year and so on). Higher percentiles can be achieved with higher fault tolerance. Means failure of one (perhaps more) node won’t cause entire system to fail and stop serving incoming requests.
Performance at scale : A system is said to be scalable if it’s performance doesn’t drop whether it is serving 10mn requests or 10. To achieve such performance at such scale, it has to be distributed so that the workload can be shared among multiple machines. Another plus point of distributed systems from performance standpoint is that the systems can be distributed geographically. For example, if a user from India is communicating with server located in India will deliver better performance than the server located in Europe (because of lesser network round trip time).
Challenges (i.e. Cons) :
Network failure : Computer networks play an important role in distributed system as they facilitate the communication or information transfer that might happen between two or more nodes. Networks are not 100% reliable, though quality of networks has significantly over time. Failure in networks can be caused by human errors (i.e. misconfiguration of devices etc.) or natural factors (i.e. a shark biting underwater optic fiber cable etc.).
Node failure : Nodes failure is another important challenge needs to be considered to build distributed systems. Now node failure encompasses wide variety of scenarios which can range from hardware failure to a bug in the software to possible human error. Any scenario which can cause node not operation correctly can be considered as node failure. Think of hardware failure because it’s outdated or a process failure because there’s some bug in the code which went unnoticed and it caused some sort of deadlock etc.
Non-deterministic nature of failure : The tricky part of diagnosing failures in a distributed systems can be the non-deterministic nature of failures. Means the failure can be partial or complete and we don’t have a way to determine if it is partial failure (so that we can wait it to be functional) or complete (so that we can move on and initiate next course of actions) before a certain (perhaps well defined time limit). Consider a network failure, where your message is just not getting through the network to the receiver. Possibility is that the entire network line has failed, or that the traffic on a particular connection is too much causing congestion (your message will reach to the receiver in this case). In this scenario, it is very difficult to determine the nature of failure.
Clock Synchronization : Clocks in any computing systems are usually Quartz crystal oscillators which vibrate at certain frequency which then can be interpreted as time ( in terms of seconds, minutes and hours etc.). Now these oscillations stay fairly constant at room temperature ( resulting in accurate interpretation of time), they tend to oscillate faster or slower if room temperature increases or decreases which is likely to happen in heated server rooms. This drift can hamper the interpretation of time. The processes which are function of time in any way can cause inconsistencies in distributed systems. Think of logging system or messaging system. Incorrect timings (even at the scale of seconds) can potentially have huge impacts on real world.
Consistency : Working around with consistency levels in distributed systems is vital part of designing distributed systems. The consistency levels range from stronger ones (where data is immediately consistent across nodes) to the weaker ones (where data is not immediately consistent) and they depend upon use case (for example, banking systems have to be immediately consistent and social media don’t have to be). Stronger consistency models are expensive (in terms of CPU, Memory and bandwidth usage) as we cannot declare completion of a task before we receive acknowledgements from all the nodes in distributed systems. So studying and implementing trade-offs around consistency is perhaps most interesting (or dreaded) part of distributed system design.
Infrastructure components in distributed systems :
Storage : Storage system is used to store data (obviously), their type varies as per volatility(persistent or volatile), storage format (blob stores, databases) etc. Primary function of these infrastructure components is to retain information, which can be later consumed by a task in future. Now there can be various reasons to distribute the storage across multiple nodes as discussed earlier. Sheer volume of data might be so huge that a single node cannot possibly store it can be one of the reasons to consider distributing the storage. Consider a database with 100 billion rows which can’t be stored at one node. So partitioning the database into 4 nodes with 25 billion rows each does make sense.
Compute : Computing systems are the systems where actual computation takes place (ex. bare metal CPU, Containers etc.). Map-reduce framework provides amazing implementation of distributed computing, which mainly came into prevalence to perform computations on huge scale data. Idea is to split the computing workload among multiple machines and then those machines can work parallelly to complete the task. Any system which serves enormous amount of traffic (requests by clients), is most probably using distributed compute machines. For example, consider a website serving 20 million requests per second. It’s highly unlikely that one compute machine can serve all those requests. Most likely scenario is that there might be multiple compute instances behind a load balancer which distributes the incoming traffic among those compute instances.
Networking : Networking machines take care of any sort of communication that might happen among the nodes. Load balancers, proxies, gateways, routers etc. are some of the examples of networking devices which might come in play in distributed systems. The function of these devices is to facilitate the communication. Think of entire path a web request might have to follow. The home router, then switches, then optic fiber cables, then reverse proxies at server end, the load balancers which distribute workload among compute devices. All these come under networking infrastructure in distributed systems.