Database

Relational DB and SQL

Training

ISO standards of SQL

The term SQL was coined in 1983. It combines DML and DDL.

Transaction and ACID properties

A transaction is 'a transformation of state' that has the ACID properties (Atomicity, Consistency, Isolation, Durability), a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc.

Scalability and performance

Graph databases

Refer to semantic web.

Databases & Object Persistence

Distributed databases

Refer also to crypto-applications and blockchain.

CAP theorem

The CAP theorem (also named Brewer's theorem) states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: In particular, the CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions.

No distributed system is safe from network failures, thus network partitioning generally has to be tolerated. In the presence of a partition, one is then left with two options: consistency or availability. When choosing consistency over availability, the system will return an error or a time-out if particular information cannot be guaranteed to be up to date due to network partitioning. When choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date due to network partitioning.

In the absence of network failure – that is, when the distributed system is running normally – both availability and consistency can be satisfied.

CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made.

Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE (Basically Available, Soft state, Eventual consistency) philosophy, common in the NoSQL movement for example, choose availability over consistency.

Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Eventual consistency, also called optimistic replication, is widely deployed in distributed systems, and has origins in early mobile computing projects. A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence. Eventual consistency is a weak guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints. Eventually-consistent services are often classified as providing BASE semantics, in contrast to traditional ACID (Atomicity, Consistency, Isolation, Durability) guarantees.

The CAP theorem was formally proved to be true by Seth Gilbert and Nancy Lynch of MIT in 2002. In distributed systems, however, it is very likely that there will be network partitioning, and that at some point, machines will fail and cause others to become unreachable. Networking issues such as packet loss or high latency are nearly inevitable and have the potential to cause temporary partitions. This leads the conclusion that a distributed system must do its best to continue operating in the face of network partitions (i.e. be partition tolerant), leaving only two real options to compromise on: availability and consistency.

In 2012 Brewer provided an update of the CAP perspective, describing the 'pick two' axiom as somewhat misleading, due to avances in partition technology.

NoSQL

The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Good intro at http://www.nosql-database.org/.

Key-value stores, column stores, document stores, graph databases, object databases, XML databases, multi-model databases

Sharding and shared-nothing architecture

Sharding could be termed a kind of shared-nothing architecture that’s specific to databases. A shared-nothing architecture is one in which there is no centralized (shared) state, but each node in a distributed system is independent, so there is no client contention for shared resources. The term was first coined by Michael Stonebraker at the University of California at Berkeley in his 1986 paper “The Case for Shared Nothing.”

Shared-nothing architecture was popularised by Google's Bigtable database and its MapReduce implementation that do not share state, and are therefore capable of near-infinite scaling.

Lucene and Solr

Google's LevelDB

Apache CouchDB

Couch is an acronym for cluster of unreliable commodity hardware. It was first released in 2005 and became an Apache Software Foundation project in 2008.

Apache CouchDB is an open-source document-oriented NoSQL database, implemented in Erlang. It uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.

CouchDB differs from others by accepting eventual consistency, as opposed to putting absolute consistency ahead of raw availability, like RDBMS or Paxos.

Unlike a relational database, a CouchDB database does not store data and relationships in tables. Instead, each database is a collection of independent documents. Each document maintains its own data and self-contained schema. An application may access multiple databases, such as one stored on a user's mobile phone and another on a server. Document metadata contains revision information, making it possible to merge any differences that may have occurred while the databases were disconnected.

At the heart of CouchDB is a powerful B-tree storage engine. A B-tree is a sorted data structure that allows for searches, insertions, and deletions in logarithmic time. CouchDB uses MapReduce to compute the results of a view. MapReduce makes use of two functions, “map” and “reduce”, which are applied to each document in isolation. Being able to isolate these operations means that view computation lends itself to parallel and incremental computation. More important, because these functions produce key/value pairs, CouchDB is able to insert them into the B-tree storage engine, sorted by key.

Documents in CouchDB are versioned, and if you want to change a value in a document, you create a new version of that document so you end up with two versions of the same document, one old and one new. How does this offer an improvement over locks? Consider a set of requests wanting to access a document. The first request reads the document. While this is being processed, a second request changes the document. Since the second request includes a completely new version of the document, CouchDB can simply append it to the database without having to wait for the read request to finish. When a third request wants to read the same document, CouchDB will point it to the new version that has just been written. During this whole process, the first request could still be reading the original version. +++A read request will always see the most recent snapshot of your database at the time of the beginning of the request.+++

CouchDB implements a form of multiversion concurrency control (MVCC) so it does not lock the database file during writes. Conflicts are left to the application to resolve. Resolving a conflict generally involves first merging data into one of the documents, then deleting the stale one.

Other features include document-level ACID semantics with eventual consistency, (incremental) MapReduce, and (incremental) replication. One of CouchDB's distinguishing features is multi-master replication, which allows it to scale across machines to build high-performance systems.

What happens when you change the same document in two different databases and want to synchronize these with each other? CouchDB’s replication system comes with automatic conflict detection and resolution. When CouchDB detects that a document has been changed in both databases, it flags this document as being in conflict. When two versions of a document conflict during replication, the winning version is saved as the most recent version in the document’s history. Instead of throwing the losing version away, CouchDB saves this as a previous version in the document’s history, so that you can access it if you need to. This happens automatically and consistently, so both databases will make exactly the same choice. It is up to you to handle conflicts in a way that makes sense for your application. You can leave the chosen document versions in place, revert to the older version, or try to merge the two versions and save the result.

MongoDB

Apache Cassandra

Cassandra basics

The Cassandra database is a shared-nothing architecture, as it has no central controller and no notion of primary/secondary replicas; all of its nodes are the same. Its data model can be described as a partitioned row store, in which data is stored in sparse multidimensional hashtables. Somewhat confusingly, this type of data model is also frequently referred to as a wide column store.

'A Decentralized Structured Storage System' by Facebook’s Lakshman and Malik was a central paper on Cassandra. It was initially developed at Facebook to power the Facebook inbox search feature.

DataStax: in terms of the CAP theorem, Cassandra can be tuned to act more like a CP (consistent and partition tolerant) or AP (highly available and partition tolerant) system. Note that it is not possible to "tune" Cassandra into a completely CA system.

Many of the early production deployments of Cassandra involved storing user activity updates, social network usage, recommendations/reviews, and application statistics. These are strong use cases for Cassandra because they involve lots of writing with less predictable read operations, and because updates can occur unevenly with sudden spikes. The ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.

Apache Cassandra

Cassandra technicalities

Further

MarkLogic

Cache management with atomicity: see Redis, https://en.wikipedia.org/wiki/Redis.

ElasticSearch

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java.

NewSQL

In 2012 two key papers were produced regarding new forms of SQL databases. These are the Calvin transaction protocol paper from Yale and the Google Spanner paper. FaunaDB is an example of the first, Google Cloud Spanner, Cockroach DB and YugaDB are examples of the second.