3.0 KiB

Raw Permalink Blame History

sled safety model

This document applies STPA-style hazard analysis to the sled embedded database for the purpose of guiding design and testing efforts to prevent unacceptable losses.

Outline

purpose of analysis
model of control structure
identify unsafe control actions
[identify loss scenarios][#identify-loss-scenarios)
resources for learning more about STAMP, STPA, and CAST

Purpose of Analysis

Losses

We wish to prevent the following undesirable situations:

data loss
inconsistent (non-linearizable) data access
process crash
resource exhaustion

System Boundary

We draw the line between system and environment where we can reasonably invest our efforts to prevent losses.

Inside the boundary:

codebase
- put safe control actions into place that prevent losses
documentation
- show users how to use sled safely
- recommend hardware, kernels, user code

Outside the boundary:

Direct changes to hardware, kernels, user code

Hazards

These hazards can result in the above losses:

data may be lost if
- bugs in the logging system
  - Db::flush fails to make previous writes durable
- bugs in the GC system
  - the old location is overwritten before the defragmented location becomes durable
- bugs in the recovery system
- hardare failures
consistency violations may be caused by
- transaction concurrency control failure to enforce linearizability (strict serializability)
- non-linearizable lock-free single-key operations
panic
- of user threads
- IO threads
- flusher & GC thread
- indexing
- unwraps/expects
- failed TryInto/TryFrom + unwrap
persistent storage exceeding (2 + N concurrent writers) * logical data size
in-memory cache exceeding the configured cache size
- caused by incorrect calculation of cache
use-after-free
data race
memory leak
integer overflow
buffer overrun
uninitialized memory access

Constraints

Models of Control Structures

for each control action we have, consider:

what hazards happen when we fail to apply it / it does not exist?
what hazards happen when we do apply it
what hazards happen when we apply it too early or too late?
what hazards happen if we apply it for too long or not long enough?

durability model

recovery
- LogIter::max_lsn
  - return None if last_lsn_in_batch >= self.max_lsn
- batch requirement set to last reservation base + inline len - 1
  - reserve bumps
    - bump_atomic_lsn(&self.iobufs.max_reserved_lsn, reservation_lsn + inline_buf_len as Lsn - 1);

lock-free linearizability model

transactional linearizability (strict serializability) model

panic model

memory usage model

storage usage model

3.0 KiB Raw Permalink Blame History