Why NoSQL databases came into being and why are they gaining massive popularity?
Databases have always been integral part of computer systems. Some of the initial enterprise grade applications were basic CRUD(Create, Read, Update and Delete) applications which existed with databases. They were used to store data in an format in persistent manner which can be operated only with the application written on top of it. Then came SQL, a query language which can be used generically with most of the applications. Practices around Normalization also were being integrated into this model of relational databases. Normalization means reducing data redundancy and enhancing data integrity by satisfying some predefined conditions. This model worked perfectly fine in era of static web applications due to reliable nature and features such as ACID(Atomicity, Consistency, Integrity, Durability) transactions.
As the demands from computing systems evolved, as new use-cases such as social media, e-commerce, video streaming etc. came along, limitations of SQL databases to deal with massive scale came to forefront. Different NoSQL databases started emerging. There’re several factors contributing to this change. Application developers were restricted with SQL and its strict schema pattern with little to no room to innovate or improvise as per necessity. One of the reason of popularity of SQL was cost of storage, as single data point is stored exactly once according to normalization. But as hardware also evolved a lot, cost of storage is not a factor anymore as it is really cheap. Complex table joins on database servers started to impact the overall performance in case of SQL relational model. Then arrived NoSQL databases. NoSQL stands for Not Only SQL, which is a terrible acronym but it serves the purpose. NoSQL became an umbrella term for all the databases which broke loose from restriction of SQL.
NoSQL databases were designed for scale with denormalized data model. It took away the necessity of performing complex table joins as data separated in multiple tables was consolidated in the collections, which were schemaless(well not technically, schema is there but it’s high modifiable). These schemaless designs depend upon access patterns of the application for which the databases is required as well as the storage patterns. These gave rise to terminologies such as schema-on-read and schema-on-write respectively. They generally have a common attribute which can be indexed to avoid full table scans and in turn improve latencies.
As volume, variety and velocity of data increased even new database models were invented to accommodate specific use-cases to operate with the data to try and extract value out of it. It became de-facto choice for OLAP(Online Analytical Processing), as opposed to traditional OLTP (Online Transaction Processing) where SQL is still a default choice for transactional workloads. Once you’re done with basic modelling and setting up indexes, you can leverage the flexibility of NoSQL to innovate or improve afterwards if use-case changes or you need to add/remove a column etc. However understanding access patterns for required use-case is crucial, as there are many choices to choose from unlike SQL model. During studying Database Management Systems, we’re being taught to think of databases in relation models. Similarly many legacy application databases have not so obvious access patterns which involve deeply intricate table joins. That’s why NoSQL feels bit out of sort, but it shouldn’t be like that if you start implement first principle thinking and try to understand those access patterns.
Most applications are built by abstracting one data model on top of another. Each layer hides the complexity of the layers below by providing a different data model. These abstractions allow different groups of people and different software applications to work effectively.
The roots of relational databases lie in business data processing, transaction processing and batch processing. The goal was to hide the implementation details behind a cleaner interface.
Not Only SQL has a few driving forces:
- Greater scalability
- preference for free and open source software
- Specialised query optimisations
- Desire for a more dynamic and expressive data model
With a SQL model, if data is stored in a relational tables, an awkward translation layer is translated, this is called impedance mismatch. JSON model reduces the impedance mismatch and the lack of schema is often cited as an advantage. JSON representation has better locality than the multi-table SQL schema. All the relevant information is in one place, and one query is sufficient.
In relational databases, it’s normal to refer to rows in other tables by ID, because joins are easy. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak. If the database itself does not support joins, you have to emulate a join in application code by making multiple queries. The most popular database for business data processing in the 1970s was the IBM’s Information Management System (IMS). IMS used a hierarchical model and like document databases worked well for one-to-many relationships, but it made many-to-,any relationships difficult, and it didn’t support joins.
The main arguments in favour of the document data model are schema flexibility, better performance due to locality, and sometimes closer data structures to the ones used by the applications. The relation model counters by providing better support for joins, and many-to-one and many-to-many relationships. If the data in your application has a document-like structure, then it’s probably a good idea to use a document model. The relational technique of shredding can lead unnecessary complicated application code. The poor support for joins in document databases may or may not be a problem. If you application does use many-to-many relationships, the document model becomes less appealing. Joins can be emulated in application code by making multiple requests. Using the document model can lead to significantly more complex application code and worse performance.
Most document databases do not enforce any schema on the data in documents. Arbitrary keys and values can be added to a document, when reading, clients have no guarantees as to what fields the documents may contain.
Document databases are sometimes called schemaless, but maybe a more appropriate term is schema-on-read, in contrast to schema-on-write. Schema-on-read is similar to dynamic (runtime) type checking, whereas schema-on-write is similar to static (compile-time) type checking. The schema-on-read approach if the items on the collection don’t have all the same structure (heterogeneous).
If your application often needs to access the entire document, there is a performance advantage to this storage locality. The database typically needs to load the entire document, even if you access only a small portion of it. On updates, the entire document usually needs to be rewritten, it is recommended that you keep documents fairly small.
PostgreSQL does support JSON documents. RethinkDB supports relational-like joins in its query language and some MongoDB drivers automatically resolve database references. Relational and document databases are becoming more similar over time.
There are potentially large number to NoSQL database types, just due to sheer flexibility they provide. Here I’m listing down few of the prominent ones.
- Document databases: Store information as documents (in formats such as JSON and XML). For example: MongoDB, Amazon DocumentDB etc.
- Key-value stores: Group associated data in collections with records that are identified with unique keys for easy retrieval. Key-value stores have just enough structure to mirror the value of relational databases while still preserving the benefits of NoSQL. For example: Bigtable, DynamoDB etc.
- In-memory database: Purpose-built database that relies primarily on memory for data storage. These are designed to attain minimal response time by eliminating the need to access disks. They are ideal for applications that require microsecond response times and can have large spikes in traffic. For example: redis, memcahced etc.
- Wide-column databases: Use the tabular format but allow a wide variance in how data is named and formatted in each row, even in the same table. They have some basic structure while preserving a lot of flexibility. For example: Cassandra, HBase, Google BigTable etc.
- Graph databases: Use graph structures to define the relationships between stored data points; useful for identifying patterns in unstructured and semi-structured information. For example: Neo4j, Amazon Neptune etc.