
Part I - Foundations of Data Systems
Premise: modern applications tend to be data-intensive rather than compute-intensive.
Chapter 1: Reliable, scalable and maintainable applications
A few tools mentioned:
- in-memory data store Redis
- in-memory caching layer Memcached
- Full-text search servers e.g. Elasticsearch or Solr
Modern data tools often serve multiple overlapping use cases: for example, Redis can function as a message broker, while Apache Kafka offers database-like durability guarantees. At the same time, modern applications tend to have so many requirements that multiple tools need to be architected together.
This chapter covers three key variables that data system design often optimizes for: reliability, scalability, and maintainability.
1. Reliability
A reliable system basically keeps working when things go wrong. A single thing going wrong is called a fault, while a failure means the whole system stops working. Faults can be made unlikely but they are inevitable. So, in general, implementing fault-tolerance is more realistic than no fault at all. Unless the error cannot be tolerated (e.g. security leaks). The book focuses on tolerable faults.
Testing fault-tolerance with chaos engineering: Netflix Chaos Monkey, then Simian Army.
Hardware faults (MTTF) vs. Software faults vs. Human errors
Prefer software fault-tolerance over hardware redundancy. Hardware redundancy is not sustainable with data volumes and/or computing demand. Bonus: this helps with maintainability, e.g no need for planned downtimes. Also, there is generally less correlation among hardware faults than among software bugs: example of Linux’ kernel bug with the leap second of June 30, 2012.
2. Scalability
The scalability of a system is its ability to cope with increased load.
Load
Load can be evaluated with respect to load parameters:
- requests/s (web server)
- reads/writes ratio (database)
- active users/s (chat room)
- Hit rate (cache)
- …
Twitter (2012)
Post tweet: 5k requests/s on average, max 12k requests/s
Home timeline reads: 300k requests/s
Writing a tweet is an ’easy’ operation, but the fan-out is problematic. By fan-out, the author means the one-to-many relationship between a user and its followers. Posting 1 tweet also means making it available to all the sender’s followers. Twitter had two solutions at the time (combined):
- Just insert the tweet into the central collection of tweets. Fetch it when a follower of the sender requests his timeline.
- Write the tweet in the home timeline cache of each user following the sender
The first version is cheaper for posting tweets, but more expensive for home timeline reads. But there are far more reads than writes - almost 2 orders of magnitude - and this approach turned out not to be sustainable for Twitter. They switched to approach 2, then a mix: approach 2 is the general rule, except for users who have a large amount of followers.
In this case, the key load parameters are: writes/s (post), reads/s (home timeline) and number of followers/user.
Performance
2 main ways to describe scalability with regard to load, by looking at the functions:
- Fixed resources:
- Fixed performance:
What is performance?
- Throughput, e.g. batch processing systems
- Response time, e.g. online systems
Note: response time = latency + service time, even if latency and response time are often used synonymously.
The average is often used, but of course percentiles or the whole distribution provide more information.
Example: Amazon describes some response time requirements in terms of p999
Coping with load
- Vertical scaling (scaling up) vs. horizontal scaling (scaling out)
- Elastic systems: systems that can automatically add resources on load increasing
- Distributing stateless services is easier than for stateful ones
- Scaling architectures are very dependent on the application: 100 000 requests of 1kB/s and 3 requests of 2 GB/min are the same throughput, but require different scaling
3. Maintainability
Operability / Simplicity / Evolvability
Chapter 2: Data models & Query languages
Data models used to store data structures: relational model, document model, graph-based models…
Relational model
Relations and tuples = i.e tables and rows in the SQL world.
Weak points that drove the need for alternative (NoSQL) models:
- Greater freedom in terms of data structures and queries, see impedance mismatch
- Scalability
- Open-source tools availability
Object-relational impedance mismatch
Refers to the relative incompatibility between objects (OOP) and relational tables. Object-relational mapping (ORM) frameworks are tools designed to alleviate this mismatch, e.g. Hibernate or ActiveRecord.
For example, relational databases are not the best to represent one-to-many relationships: in a résumé object, there is one first_name and one last_name, but possibly many jobs or education lines. There are several ways to represent one-to-many relationships:
- Store a field (e.g.
jobs) in its own normalized table, with a foreign key likeuser_id - Use multi-valued types like a structured type (fixed number of values) or a JSON datatype (e.g. MySQL, PostgreSQL)
- Use a text field and store a JSON string
- Use a document-oriented database
Document databases
E.g. MongoDB, Espresso. JSON has better locality than multi-table relational databases: no need to perform several queries or several joins to fetch the content of one document (e.g. the résumé of a given user).
Relational vs. document databases
- one-to-many relationships: the approach is significantly different
- Relational: the many entities are stored in a separate table
- Document: the many entities are stored within the parent record
- Many-to-one and many-to-many: conceptually similar approach as foreign keys correspond to document references
Advantages of document data model (not databases)
- Schema flexibility (schema-on-read vs. schema-on-write)
- Locality: the data is stored in one string, not on multiple tables. Some relational databases implement locality e.g. Google Spanner
- Data structures more in line with applications (e.g. objects)
Advantages of relational data model (not databases)
- Joins: better support
- Many-to-one and many-to-many: better support
Query languages
Declarative vs. imperative
SQL is a declarative language, which has advantages over imperative ones:
- Concise
- Easy, as the implementation details are hidden inside the underlying engine
- Efficient: the underlying engine can be upgraded without changing high-level instructions
- Prone to parallelization
MapReduce
Graph-like Data Models
Use-case: lots of many-to-many relationships in the data, possibly heterogeneous. Neo4j or Titan are graph databases, Cypher is a query language
A significant advantage of graph query languages: unknown numbers of vertices can be traversed easily (e.g. :WITHIN*0.. in Cypher) whereas recursive CTEs are required in SQL.
Datalog is another example of programming language that is very efficient for recursive queries:
// Data
parent(alice, bob).
parent(bob, carol).
// Rules: base case, recursion
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
Datalog is used in graph databases but also IAM, Data Lineage or Static Program analysis systems.
Chapter 3: Storage and Retrieval
2 families of storage engines/indexes considered in this chapter:
- Log-structured
- Page-oriented
Log-structured databases and indexes
log: append-only sequence of records
A fundamental example
|
|
Apart from the lack of concurrency control, disk space reclaiming mechanisms, error/partial writes handling… The performance of this log as a database is very unbalanced:
- Writes: O(1)
- Reads: O(n)
Indexes
An index is an additional data structure that aims at increasing the performance of queries.
Trade-off: an index speeds up reads (if well designed) and slows down writes
Hash indexes
E.g. HashMap<Key, Offset>
Example: Bitcask,the default storage engine of distributed NoSQL key-value store Riak
Memory management
Back to the data-store (not the index): since the underlying file is append-only, how is its size kept under control?
i) Segmentation: break the log into chunks
ii) Compaction: only retain the most recent value for each key
iii) Merging: same as compaction but over several compacted segments
Details on log files
A few practical implementation issues:
- File format efficiency: binary encodings are preferable
- Deletion of records: see Tombstone
- Crashes & partial writes: snapshots (segments), checksums (record)
- Concurrency: usually only one writer thread is implemented
Why not just use the hashmap itself to store values? Because the log paradigm (append-only/no overwrites) brings benefits:
- No risk of partial overwrites: much easier concurrency/crash recovery
- Sequential writes are faster than random writes
- Less fragmentation thanks to merging
Drawbacks:
- Range queries are inefficient because the keys are stored in random order
- Hash indexes must fit in memory
SSTables/LSM-Trees
A Sorted String Table or SSTable is a log with the entries sorted by key.
Advantages:
- Merging segments: gets simple and efficient, as a mergesort-like algorithm can be used
- Sparse index: no need to keep all the keys in memory (if you know where keys 1 and 10 are, then you know 5 is somewhere in between)
- Compression & I/O brandwith: each entry of the sparse index points to a compressed block
How it works:
- Writes: insert key and value into a memtable := in-memory balanced tree (e.g. AVL, red-black)
- When size(memtable) > threshold: flush memtable to disk as an SSTable
- Reads: look into memtable, then most recent segment, then second-most-recent segment, and so on…
- Merge and compact + discard delete/overwritten records once in a while
Examples: LevelDB, an alternative to Bitcask in Riak. Bigtable (GCP).
Implementation challenges:
- Non-existing key lookups are long (requires checking the memtable + all segments). Solution: bloom filters
- Efficient compacting/merging strategies: size-tiered vs. leveled
SSTables are an implementation of Log-Structured Merge-Trees (LSM-Trees). They can sustain high throughput because the writes are sequential.
B-Trees
B-trees also sort they key-value pairs by key. They are a generalization of binary search trees.
The database is split into fixed-sized chunks called pages or blocks. Only 1 page is read/written at a time. Files are overwritten and modified in-place, with no restriction to appending data as with SStables.
Overwrites are risky and complicated. Example: writing to an already-full page requires the page to be updated and spit into two children. What if the database crashes in the middle of the operation? Orphan pages may be created.
Structure:
- Root: the first page of the B-tree, where all lookups start.
- Root and subsequent pages contain:
- Keys
- References: a reference surrounded by key X and Y leads to a page with X < keys < Y
- Leaves: pages with only references to values, or keys + values
Branching factor: number of references in child pages
Implementation challenges/optimizations:
- Crash recovery: Write-Ahead Log (WAL) a.k.a. redo log / copy-on-write schemes (also helps with concurrency)
- Store abbreviated keys, i.e just enough information to be used as boundaries between key ranges
- Additional data structure to link leaf pages in sequential order, or references to siblings
- Concurrency: see B-Tree locking
SSTables vs. B-Trees
Generally, SSTables appear to be:
- Faster for writes: writing is sequential, less write amplification, less fragmentation
- slower for reads: multiple segments to check at different stages of compaction
See detailed comparison p. 84-85
Additional considerations on indexes
Primary/secondary indexes
Unlike primary (generally clustered) indexes, secondary indexes may index duplicate values. Log-structured indexes and B-Trees can be used as secondary indexes. Techniques to deal with duplicate indexed values:
- inverted index / postings list
- Append a row identifier to each entry
Clustered/Nonclustered indexes
Values can be either referenced by indexes and stored in an auxiliary heap, or be part of the index itself:
- Clustered indexes store the value (there can be only 1)
- Nonclustered indexes store references
- Covering indexes store the values of a subset of columns. A query is said to be covered by the index if it can be answered using the covering index alone
Concatenated indexes
The key is created by simple concatenation of several fields, e.g last_name + first_name
Use-case example: spatial data, see R-trees
Fuzzy, full-text indexes
See Apache Lucene
In-memory databases
Disks have 2 significant advantages:
- Durability
- Low cost per GB
The second advantage is less and less significant over time and now many datasets fit in the RAM, hence the rise of in-memory databases.
Examples: Redis, VoltDB, MemSQL…
Some in-memory databases provide multiple levels of durability (none, weak…) by several means:
- Log (on disk)
- Snapshots (on disk)
- Replication
Counterintuitive: in-memory databases are not really faster thanks to not reading/writing to disk. Traditional databases rely on the OS caching disk blocks, which can lead to similar results in terms of speed. In-memory databases are actually faster because they avoid the overhead of encoding data structures to writeable formats.
OLTP vs. OLAP
I.e. Online Transaction Processing vs. Online Analytical Processing, also related to transaction processing vs. batch processing.
- OLTP: Low-latency reads/writes, small number of records per query, fetched by some key and using an index
- OLAP: large number of rows fetched per query, but only a few columns, and aggregated
Data warehousing
The various OLTP systems in an organization (e.g. logistics, sales, employees, suppliers…) need high availability + low latency, thus they are not suited for complex analytics queries fetching many rows. The general idea of the data warehouse is to centralize read-only copies of the various OLTP systems, through ETLs. Also, they are optimized for these queries and use different indexing strategies than what was presented above.
Key ideas: facts, dimensions, star schema, snowflake schema
Columnar storage
Most OLTP systems store the values of a given row next to each other. Instead, OLAP systems tend to use column-oriented storage. This allows reads of only a subset of columns, as often needed with analytics queries.
Columnar storage can be coupled with several technique for even more efficient data retrieval:
- Column compression
- Bitmap encoding (similar to One-Hot encoding in Data Science), which allows vectorized processing downstream
- Run-length encoding on top of bitmap encoding
- Sort order on one or several columns, e.g.
date_keyif date ranges are often targeted, or even several different sort orders (one sort order per replica, Vertica does this)
Column compression and ordering are more easily implemented with LSM-Trees with their sorted in-memory structure and incremental merges with files on disk. On the contrary, B-Trees require in-place updates, making them inappropriate for these use-cases.
Materialized views
Unlike traditional views in relational databases, materialized views are actual copies of a table + a query. Data cubes or OLAP cubes are examples of materialized views: a grid of aggregates grouped by different dimensions, e.g. date and product_sk. They enhance performance on given queries but lack flexibility: for example, once aggregated, no filtering is possible anymore e.g. it’s impossible to know the revenue on a given date, but only for a given location.
Chapter 4: Encoding and Evolution
Evolvability was introduced in chapter 1, and changes in applications generally bring about changes in data.
Chapter 2: in relational databases, only 1 schema is valid at a given time. Schema-on-read or schemaless databases allow different formats to coexist.
Encoding formats
Encoding or serialization is about writing in-memory objects to self-contained byte sequences, for storage or transmission.
Built-in programming languages encoding libraries
Not recommended: typically language-specific, unsafe, without proper versioning support and inefficient.
JSON, XML, CSV…
Broadly supported, but inherently ambiguous and without native support for binary data.
Binary encodings
Much more compact than human-readable formats, but require a schema. Appropriate for more “internal” data transfers.
Protobuf
Protocol Buffers or Protobuf is the most popular and widely spread binary encoding.
- Forward compatibility: newer fields are ignored by older schema
- Backward compatibility: works if newer fields are optional
Beware with data type evolution. It may work, but it can be risky. E.g. 32-bit -> 64-bit is backward-compatible but not necessarily forward-compatible as 64-bit numbers might get truncated when read with the old 32-bit type.
Avro
Avro is even more lightweight than Protobuf because it externalizes the schema. The writer’s schema and the reader’s schema may be different, but compatible according to Avro’s specifications. There are different ways to enforce a schema:
- Write the schema at the beginning of an Avro file
- Maintain schemas separately and identify them by version numbers, then each record can reference a schema version
- Negotiate the schema with the Avro RPC protocol
Dynamically-generated schemas
Avro is more convenient for dynamically generated schemas because fields are identified by name. It automatically handles schema reconciliation between the writer’s and reader’s schemas. In contrast, Protobuf and Thrift use numeric field tags, and the user is responsible for maintaining tag consistency to ensure compatibility.
Dataflow
- Data might be lost when an older version of an app updates records previously written by a newer version of the app
- Discussion about REST, RPC
- Message brokers