Database

Relational DB and SQL

Training

ISO standards of SQL

The term SQL was coined in 1983. It combines DML and DDL.

Transaction and ACID properties

A transaction is 'a transformation of state' that has the ACID properties (Atomicity, Consistency, Isolation, Durability), a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc.

Scalability and performance

Graph databases

Refer to semantic web.

Databases & Object Persistence

Distributed databases

Refer also to crypto-applications and blockchain.

CAP theorem

The CAP theorem (also named Brewer's theorem) states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: In particular, the CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions.

No distributed system is safe from network failures, thus network partitioning generally has to be tolerated. In the presence of a partition, one is then left with two options: consistency or availability. When choosing consistency over availability, the system will return an error or a time-out if particular information cannot be guaranteed to be up to date due to network partitioning. When choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date due to network partitioning.

In the absence of network failure – that is, when the distributed system is running normally – both availability and consistency can be satisfied.

CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made.

Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE (Basically Available, Soft state, Eventual consistency) philosophy, common in the NoSQL movement for example, choose availability over consistency.

Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Eventual consistency, also called optimistic replication, is widely deployed in distributed systems, and has origins in early mobile computing projects. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is a weak guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints. Eventually-consistent services are often classified as providing BASE semantics, in contrast to traditional ACID (Atomicity, Consistency, Isolation, Durability) guarantees.

The CAP theorem was formally proved to be true by Seth Gilbert and Nancy Lynch of MIT in 2002. In distributed systems, however, it is very likely that there will be network partitioning, and that at some point, machines will fail and cause others to become unreachable. Networking issues such as packet loss or high latency are nearly inevitable and have the potential to cause temporary partitions. This leads the conclusion that a distributed system must do its best to continue operating in the face of network partitions (i.e. be partition tolerant), leaving only two real options to compromise on: availability and consistency.

In 2012 Brewer provided an update of the CAP perspective, describing the 'pick two' axiom as somewhat misleading, due to avances in partition technology.

NoSQL

The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Good intro at http://www.nosql-database.org/.

Key-value stores, column stores, document stores, graph databases, object databases, XML databases, multi-model databases

Sharding and shared-nothing architecture

Sharding could be termed a kind of shared-nothing architecture that’s specific to databases. A shared-nothing architecture is one in which there is no centralized (shared) state, but each node in a distributed system is independent, so there is no client contention for shared resources. The term was first coined by Michael Stonebraker at the University of California at Berkeley in his 1986 paper “The Case for Shared Nothing.”

Shared-nothing architecture was popularised by Google's Bigtable database and its MapReduce implementation that do not share state, and are therefore capable of near-infinite scaling.

Lucene and Solr

Google's LevelDB

Apache CouchDB

MongoDB

Apache Cassandra

Cassandra basics

The Cassandra database is a shared-nothing architecture, as it has no central controller and no notion of primary/secondary replicas; all of its nodes are the same. Its data model can be described as a partitioned row store, in which data is stored in sparse multidimensional hashtables. Somewhat confusingly, this type of data model is also frequently referred to as a wide column store.

'A Decentralized Structured Storage System' by Facebook’s Lakshman and Malik was a central paper on Cassandra. It was initially developed at Facebook to power the Facebook inbox search feature.

DataStax: in terms of the CAP theorem, Cassandra can be tuned to act more like a CP (consistent and partition tolerant) or AP (highly available and partition tolerant) system. Note that it is not possible to "tune" Cassandra into a completely CA system.

Many of the early production deployments of Cassandra involved storing user activity updates, social network usage, recommendations/reviews, and application statistics. These are strong use cases for Cassandra because they involve lots of writing with less predictable read operations, and because updates can occur unevenly with sudden spikes. The ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.

Apache Cassandra

Cassandra technicalities

Further

MarkLogic

Cache management with atomicity: see Redis, https://en.wikipedia.org/wiki/Redis.

NewSQL

In 2012 two key papers were produced regarding new forms of SQL databases. These are the Calvin transaction protocol paper from Yale and the Google Spanner paper. FaunaDB is an example of the first, Google Cloud Spanner, Cockroach DB and YugaDB are examples of the second.