Fixing DistKV network problems

As the DistKV network is fully asynchronous, there’s no way to avoid getting into trouble – there’s no arbitration of inconsistent data.

This document explains how to get back out, if necessary.

Missing data

See the Server protocol <server_protocol> for details on how DistKV works. From that document it’s obvious that when a node increments its tick but the associated data gets lost (e.g. if the node or its Serf agent crashes), you have a problem.

Worse: a server will not start if the “missing” list is non-empty. The problem is that stale data causes difficult-to-resolve inconsistencies when written to. TODO: allow the server to be in maintainer-only mode when that happens.

First, run distkv client internal state -ndmrk. Your output will look somewhat like this:

deleted:  # Ticks known to be deleted
  test1:
  - 12
known:  # Ticks known to be superseded
  test1:
  - 1
  - - 3
    - 10
  test2:
  - 1
missing:  # Ticks we need to worry about
  test1:
  - 2
node: test1  # the server we just asked
nodes:  # all known nodes and their ticks
  test1: 12
  test2: 1
remote_missing: {}  # used in recovery
tock: 82  # DistKV's global event counter

This is not healthy: The missing element contains data. You can manually mark the offending data as stale:

one $ distkv client internal mark test1 2
known:
   test1:
   - - 1
     - 11
   test2:
   - 1
 node: test1
 tock: 92  # If this is not higher than before, clean your glasses ;-)
 one $

This shows that the offending tick has been successfully added to the known list. Calling distkv client internal state -m verifies that the list is now empty.

Use the --broadcast flag to send this message to all DistKV servers, not just the one you’re a client of.

This action will allow the bad record to re-surface when the node that has the record reconnects, assuming that there is one. You can use the mark command’s --deleted flag to ensure that it will be discarded instead.