SWIM Introduction

Goals

This project is intended to implement the SWIM white paper, with elements drawn from Serf as well. The goals are simple:

  • Each member in a cluster should be eventually aware of all other nodes in the cluster, and is able to run code when a node’s status changes.

  • Each member may define a set of key/value strings as its metadata. All members are eventually aware of all other members’ metadata, as well as any changes.

Failure Detection

The local member chooses a remote member at random and sent a ping. If the remote member does not respond with an ack within a timeout, at least one other remote member is chosen and sent a ping-req. If the remote member still does not respond with an ack, it is declared suspect. This process repeats indefinitely across every member of the cluster. If ever a suspect or offline member responds with an ack via either a ping or a ping-req, it is immediately returned to online status.

Dissemination

The local member chooses a remote member at random to send gossip. The last known sequence clock of the remote member is used to determine what other members to gossip about. For example, if member A wanted to gossip about member B to member C, and member C last reported a sequence clock of 9, while B was last updated at sequence clock 13, then A would send the gossip. C would then report a new sequence clock of 13 (or higher!) and future gossip about B would not be sent. This process repeats indefinitely across every member of the cluster. At rest, where all members have received the most recent gossip about all other members, no gossip is sent at all until the next time a member changes status or metadata.

Glossary

SWIM

Scalable Weakly-consistent Infection-style Process Group Membership Protocol, defined in the SWIM white paper.

member
node

A member is an instance that is initially aware of at least one other member, and then transitively is made aware of other members that are known by that member, forming a cluster.

local member

The member corresponding to the current process. This member is always online and is the only member whose metadata may be changed. The local member perceives other members in the cluster based on their most recently known status and metadata.

peer member

One of the member instances that is remote in the cluster. A copy of each peer member’s status and metadata is maintained by the local member and updated based on failure detection and dissemination.

status

One of three states that a peer member can hold, as perceived the local member: online, suspect, or offline.

online

A status meaning recent failure detection attempts have successfully received an ack from the peer member.

suspect

A status meaning that a recently-online member has not responded to at least one failure detection attempt.

offline

A status meaning the member has not responded with an ack to any recent failure detection attempts. A suspect member becomes offline only after some time elapses, to prevent false positives.

metadata

An immutable mapping of key/value strings associated with each member. New mappings may be assigned, and the latest mapping will always be disseminated across the cluster.

transport

An interface for implementing alternative transmission mechanisms instead of the builtin udp. A transport is capable of sending and receiving packet messages.

packet

A simple, one-way message that is sent from the member to another. In udp, these directly correspond to datagrams.

ping

A packet that requests that a peer member reply to the source member with an ack. This is the most basic attempt to detect when members have gone offline.

ping-req

A packet that requests that a peer member send its own ping to a second peer member. If the recipient receives an ack, it is forwarded back to the source member.

ack

A packet sent in response to a ping or ping-req indicating that the source member is online.

gossip

A packet that informs one member of the currently known status and metadata of another member.

sequence clock

A Lamport timestamp, an always-increasing counter where the next value is always higher than any other observed value, used to determine whether gossip is new enough to be applied or disseminated.

demo

The included demo is designed to show typing being disseminated across the cluster, as well as watch as member statuses change as demo instances are stopped and started.