<![CDATA[顿]]>https://frozen-ghost.pages.dev/https://frozen-ghost.pages.dev/favicon.pnghttps://frozen-ghost.pages.dev/Ghost 4.47Thu, 19 May 2022 04:35:30 GMT60<![CDATA[Destruction that works with multi-threads]]>https://frozen-ghost.pages.dev/destruction-that-works-with-multi-threads/61f625edf83c6600017fdcc2Sun, 30 Jan 2022 05:59:28 GMTIn object-oriented programming, the life of an object starts with construction, and ends with destruction. The idea is that before an object can be used, we must allocate memory for it, initialize its members, then initialize itself. Similarly, if an object is no longer needed, we must return its resources and release the memory. It sounds simple enough, but becomes quite complicated in a multi-threaded environment.

Let's take a look at the lifetime of an object that is used by multiple threads.

Object Lifetime

When the object is first created, it is accessible by just one thread -- the creator thread (T1). T1 initializes the object and its members, then shares the object with other threads. From that point on, each thread executes at their own pace. Eventually the object reaches its end of life and should be destroyed. An object can be destroyed at most once, typically by one thread. No parallelism is needed here. Let's assume T2 ended up being the one that is responsible. T2 destructs members of the object, and then releases the memory.

The symmetry between construction and deconstruction is natural and obvious: created once, destroyed once; created by one thread, destroyed by one thread.

The asymmetry is also there. The tasks of T1 and T2 are different. When T1 creates the object, it knows that only itself can access the object it just created. However when T2 is about to destroy the object, more than one thread has access to it. For T2 to do its job, it must somehow make sure other threads have stopped using the object. The mechanism can be

  1. A global shutdown signal that is sent to all threads (shutdown_condvar).
  2. A shared pointer that assigns the duty of destroying an object to the last bearer of the object (shared_ptr).
  3. A state of the object itself that denotes an invalid state (weak_ptr) so that T2 can destroy the object at any time.
  4. A ref count included in the object itself, and a blocking mechanism to wait for that count to reach zero.

My favorite approach is #4 because it is most universal, and has deterministic runtime behavior when compared with shared_ptr. We know upfront that T2 is the thread that destroys the object, not a latency sensitive thread or a thread that won't finish running a time consuming task in 10 minutes.

Notify and Destroy

In this approach, T2 would usually need to let other threads know that they must give up the object. It could notify other threads by either changing a boolean that other threads are watching, or sending a message to all holders of the object. Then T2 must wait for the ref count to reach zero (or one if T2 itself is counted). After that it can safely destroy the object and release the memory.

The waiting part is unavoidable. T2 must release the memory, and no more than one thread can release the memory. Even if all operations before releasing the memory are thread safe, we still need to have one thread do the last thing, and that thread has to wait.

There is an alternative. We could just not release the memory at all. The object is safe to use all the way to the end of the process. This is the easiest thing to do, but does put pressure on memory usage.

]]>
<![CDATA[DOCTYPE changes CSS behavior]]>https://frozen-ghost.pages.dev/doctype-changes-css-behavior/615240a2c345d500013c0e98Mon, 27 Sep 2021 22:14:14 GMTIt seems CSS behaves differently with and without the DOCTYPE comment. The HTML for my blog looks like

<!DOCTYPE html>
<html lang="">
<head>
        <meta charset="utf-8" />
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
...
</head>
...
</html>

And the page renders fine.

However, if I remove the !DOCTYPE comment from the header, something odd happens: the social network icons at the bottom of my page float up a little bit.

I really don't know what is happening.

]]>
<![CDATA[Object lifetime and threading]]>Last time we talked about object lifetime and ownership. Naturally scopes and objects form a tree hierarchy. The root of the tree is the scope where the program starts executing. Beyond the tree structure, we can pass information between scopes with the help of dynamic lifetime. Dynamic lifetime is hard

]]>
https://frozen-ghost.pages.dev/object-lifetime-and-threading/614fe349c345d500013c0e8bSun, 26 Sep 2021 03:05:52 GMTLast time we talked about object lifetime and ownership. Naturally scopes and objects form a tree hierarchy. The root of the tree is the scope where the program starts executing. Beyond the tree structure, we can pass information between scopes with the help of dynamic lifetime. Dynamic lifetime is hard to manage and is also the #1 source of bugs ("use after free") in C programs. The concept of ownership can simplify many useful cases of dynamic lifetime.

Threads

Threads make lifetime more complex. We now have several starting points where threads might start executing independently. No assumption can be made about the progress of each thread. At one point in thread A, there is usually no guarantee whether an object in thread B has been initialized/destroyed or not. That makes data sharing between threads extremely difficult. As it turns out, making sure the data is alive while being shared is another hard problem to solve. Doing it wrong, our C program crashes and throws "core dumps" at us. There are many clever ways to guarantee liveness. But we are more interested in the foolproof ways that take advantage of single ownership.

Copy instead of share

Sharing implies more than one owner. Multiple owners are hard to coordinate with. Instead of sharing, we make copies of the data for each of the interested parties. Those copies will have independent lifetimes, each owned by one thread. That simplifies the situation because each copy has a single owner.

In the case of sharing between two parties, the actual copying can be saved if the source (initiator) of sharing does not need the object anymore. Then sharing is reduced to a simple "transfer ownership" operation.

There are other ways of describing this strategy. We can think of it as sending a message. The source of sharing makes a copy of the shared data, and sends it to the target as a message. A message, by definition, goes out of control of the source  after being sent. The target owns the message after receiving it. An RPC request from the source to the target achieves the same goal.

The famous paper Communicate Sequential Processes proposed a similar strategy. Shared data is considered an output of the source, and an input of the target. An output cannot be modified after the source outputs it. An input is solely owned by the target thread. To some extent, the input/output metaphor is similar to the messaging metaphor.

By avoiding sharing, we avoid the difficulties of managing shared lifetimes. The drawbacks are more memory usage and more CPU time to copy data.

Reference counting

Reference counting is widely adopted as a native feature in many programming languages. For each piece of data, we keep a count of how many outstanding references there are. The data dies when there are no more references out there.

The advantage of reference counting is that it can be fully automatic. The programmer no longer needs to manage lifetime manually. The drawback is that it is usually unpredictable when the underlying object dies, or who will end up cleaning up the object. This can be a problem in languages that use RAII extensively like C++. Sometimes it is important to not run certain deconstructors on certain threads.

Speaking of C++, shared_ptr is the tool that implemented the reference counting strategy. unique_ptr is often listed side-by-side with shared_ptr. They happen to correspond to the two strategies we talked about: unique_ptr is about message passing, while shared_ptr is about data sharing.

Carrier

The Carrier pattern improves upon reference counting and addresses the deconstruction problem. In this pattern, there is a Carrier<T> that owns an instance of any T. The carrier distributes references to the owned instance. References can be passed around and maybe used in other threads, and are guaranteed to be valid. When the shutdown procedure is started, the carrier stops producing references and waits for all references it gave out. Gradually other parties drop their references after receiving the shutdown signal. After all references are dropped, the instance of T is no longer shared, but solely owned by the carrier. We can then drop the instance or run cleanups that require an owned instance.

Careful planning

There are really clever ways to manage object lifetime by planning very carefully. For example, for each function, we should be very clear about who is responsible for cleaning up the objects involved in the function call. While it can be done, I would not recommend implementing those cleverness regularly. Try to fit your use cases into one of the regular ones. If nothing works, maybe you should roll your own.

Thread Safety

Note none of these strategies help with thread safety of the object being shared. Thread safety is about

  • If one thread reads the shared data, could it see partially updated / invalid data? Could the data change while the thread is executing?
  • If one thread writes the shared data, could its writes be observed partially by other threads? Could its writes be overridden partially by other threads?

Even the copying strategy is vulnerable to objects that are not thread safe. That is because objects can have references to other objects that are not deeply copied. The inner objects are still shared by all threads, although each of the outer objects has only one owner.

Generally speaking, an object that does not have mutable internal state is safe to share between threads. Any immutable reference to such objects is safe to send to another thread. If you are familiar with Rust, I believe those are the definitions of Sync and Send.

Conclusion

Managing lifetimes across threads is hard. There are clever ways to coordinate between threads. By preferring single ownership, we found three simple but powerful strategies for special use cases.

]]>
<![CDATA[Object lifetime and ownership]]>Before learning Rust, I never thought about object lifetime and ownership that much. It turns out they have many things to do with memory safety and thread safety. Nowadays I think about lifetime and ownership all the time, even when writing programs in C++. Here is a summary of my

]]>
https://frozen-ghost.pages.dev/object-lifetime-and-ownership/613933373e92f0000134a206Wed, 08 Sep 2021 22:07:33 GMTBefore learning Rust, I never thought about object lifetime and ownership that much. It turns out they have many things to do with memory safety and thread safety. Nowadays I think about lifetime and ownership all the time, even when writing programs in C++. Here is a summary of my thoughts, inspired by Rust, but applicable to any programming language.

Each object has a lifetime. An object is alive, when we can access it. In this article, an "object" is a generic term that refers to a "thing" that represents a piece of memory or other resources. A string, an integer or a struct in C are all "objects".

Scope

The concept of a scope is well known. In C-style languages, a scope is usually a code block. When we start executing a code block between a pair of { and }, a scope is created. When we are done with the code block, the scope is destroyed. If an object is owned by a scope, its lifetime is bounded by the scope. The object goes out of life when the scope ends. Here is a minimal example.

{ // A scope is created.
  int x = 0; // Life of x starts here.
  ...
  x += 2;
} // Life of x ends here, when the scope ends.
// Cannot use x anymore.

Two blocks can own different sets of objects. They should not access or modify variables owned by each other. This is useful when we want isolation between code blocks. Functions, closures, loop bodies, branches all create scopes. Loop bodies are special, since they are essentially many scopes that look just like each other.

Object hierarchy

Object hierarchy is also a common concept. One object can own another object. When the outer object dies, the inner object dies with it. We say the lifetime of the inner object is bounded by the outer object.

struct ShortString {
  char content[100];
  size_t len;
}

In a string, the bytes in memory are usually owned by the string itself. When the string dies, we no longer need the bytes. Thus we usually choose to destroy those bytes when the string goes out of life.

Both scopes and objects allow nesting. Scopes and objects can own other scopes and objects. Together they form a tree structure of ownership.

Ownership tree. Execution order is from top to bottom. Squares are objects.

Ownership is useful

Let's have a look at a classic example of unsafe memory access.

ShortString *create_empty_short_string() {
  ShortString str;
  return &str;
}

int *ptr = create_empty_short_string();  // <-- a scope is created and destroyed.
// str has gone out of life
// ptr points to something that does not exist.
ptr->len = 10;  // boom!

Using the concept of ownership, we can see that str is owned by the function scope. When that scope is gone, so is str. That is why we cannot safely access *ptr. *ptr is known as a "dangling pointer. This is a confusing problem for C/C++ beginners. It is easy to explain when we introduce the concept of "ownership". Ownership helps us understand memory safety.

Here is another observation. In the scope/object tree, a child scope can safely access any object that belongs to its parent. The child scope lives shorter, while the object lives slightly longer. That is why the code inside an if-else block can always use variables in the enclosing block.

Passing ownership

To exchange information between scopes, ownership of objects can be passed around. For example, when a parameter is passed to a function, so can be the ownership of the parameter. When a value is returned from a function, often the ownership is transferred from the function scope to the calling scope.

ShortString random_short_string(
  params RandomParams // Ownership of params is passed to the function.
) {  
  ShortString str;
  for (int i = 0; i < params.size; i++) {
    str.content[i] = 'a' + i;
  }
  str.len = param.size;
  return str;
}

ShortString short_string = random_short_string({size = 10});
// Ownership of the string is passed to the calling scope.

Some of the C++ experts might start to scream and yell "but that value is copied!", copy elision, move semantics and so on. Please stop thinking about implementation details and focus on the intention. The intent of random_short_string(), is clearly to hand over str to any caller. No copy has to be made, because random_short_string()  does not want to keep the original copy for itself. Clear ownership helps avoid copying.

Dynamic lifetime

We often need more flexibility than a tree structure, as well as beyond the interaction of two scopes. That flexibility can be archived by a third type of lifetime: good until deleted.

ShortString *create_short_string() {
  return new ShortString();
}

ShortString *str = create_short_string();  // <-- a scope is created and destroyed.
// But the str lives beyond the scope.
str->len = sscanf("%s", &str->content);

...

// This function takes ownership of str.
void print_short_string(ShortString *str) {
  // str is now owned by this function.
  str->content[str->len] = '\0';
  printf("%s\n", str->content);
  delete str;  // str dies here.
}

print_short_string(str); // str passed to the function.
str->len;  // boom! No longer safe to access str.

Unlike "owned" objects, there is little guarantee around dynamic lifetime. A pointer can point to a valid object. It can also point to an object that has been deleted. The programmer must make sure the object is still alive when dereferencing a pointer. That, as it turned out, is a super hard thing to do.

Easier cases

Dynamic lifetime is hard. Overtime people discovered two special cases of dynamic lifetime that are easier to reason about: single ownership and shared ownership.

Single ownership: an object is passed around between scopes and objects, but at any pointing time, it can only be accessed by one owner. If the current owner decides the object is not needed anymore, the object goes out of life. It is the responsibility of the current owner to clean the object up.

Shared ownership: an object is shared between many scopes and objects. The object is alive as long as one of them still needs the object. The last owner is responsible for cleaning it up. Often it is not clear which scope that would be just by reading the code.

These two cases roughly correspond to std::unique_ptr and std::shared_ptr in C++. Unfortunately the complex syntax of C++ (e.g. std::move, && etc) is not really the best tool for demonstrations. We do not have a code example here. However the conclusion is clear, ownership simplifies dynamic lifetime.

Static lifetime

Static is such an overloaded term. It is not the opposite of "dynamic" we talked about above. Here it means "good until the end of the program". An object is said to have a static lifetime, when it is alive throughout the whole program. Such objects are usually not destroyed by user code. Making everything static is a good way to solve the dangling pointer problem, except that it might use too much memory.

Conclusion

We talked about all three types of lifetime, and how "ownership" helps us with memory safety. Let's discuss how they help with thread safety in the next article.

]]>
<![CDATA[Two ways to implement binary search]]>https://frozen-ghost.pages.dev/two-ways-to-write-binary-search/600f858ac3162700016a7dcbTue, 26 Jan 2021 05:28:06 GMT

There are only two correct ways to implement binary search. Here is the less known one.

int binary_search(int *nums, int len, int target) {
  int l = -1;
  int r = len;
  while (l < r - 1) {
      int mid = (l + r) >> 1;
      if (nums[mid] < target) {
        l = mid;
      } else {
        r = mid;
      }
  }
  
  return r;
}

This version returns the index of the first element in nums that is greater than or equal to the target. If there is no such element, len will be returned.

Although only r is returned,  l could also be of interest. It contains the index of the last element that is smaller than the target. if there is no such element, l will be -1, which is a good indicator for "no such element". It is always the case that r = l + 1 when the algorithm terminates.

Replace < with <=, l will be the last element that is smaller than or equal to the target, and  r will be the first element that is greater than the target, subject to the same special cases above if there is no such element. If operator <= is not implemented, use !(target < nums[mid]) which is logically equivalent to nums[mid] <= target.

Two ways to implement binary search
Final value of pointers

Note if there is only one element in nums that is equal to target, (<,r) and (<=,l) will converge. If target is not in nums, see the following picture.

Two ways to implement binary search
Final value of pointers – target not found

Or if you prefer precise definitions:

op symbol is the that is or
< l last smaller than (<) -1
<= l last smaller than or equal to (<=) -1
< r first greater than or equal to len
<= r first greater than len

The third line in the above table (<, r) implements lower_bound, while the fourth line (<=, r) implements upper_bound.

The drawback of this approach, is that the initially l is set to a special value -1. This might not always be possible if the index is of type usize. The advantage, is that neither l or r need to plus or minus one in each iteration, which leaves no room for error.

The other way

The other correct way of implementing binary search, is the approach used in lower_bound and upper_bound in C++ STL. It can be found on cppreference.com.

The interesting aspect of this implementation is that the two pointers started as a valid range [0, len), but terminated as an empty range [r, r), when l and r are equal. To me that makes the implementation harder to reason about. I acknowledge that you don't have to reason about it very often, though.

Criteria of being correct

To be correct, binary_search must return a reasonable value in all of the following cases, when looking for 2.

1 2 3
1 2
1 3
2 3
1
2
3
1 1 2 2 3 3
1 1 2 2
1 1 3 3
2 2 3 3
1 1
2 2
3 3

You might also want to duplicate each case into a odd length one and an even length one.

]]>
<![CDATA[Implementing Raft with Go]]>Following my previous post Raft, from an engineering perspective, I gathered some thoughts on related topics.

Latency and RPCs

Latency matters in a distributed system

Latency is tied directly to availability. The larger the latency, the longer the period the system is unavailable. RPC latency compounds quickly when multiple rounds

]]>
https://frozen-ghost.pages.dev/implementing-raft-with-go/5f6314591f5f7f000199ce1dSun, 23 Aug 2020 07:47:00 GMTFollowing my previous post Raft, from an engineering perspective, I gathered some thoughts on related topics.

Latency and RPCs

Latency matters in a distributed system

Latency is tied directly to availability. The larger the latency, the longer the period the system is unavailable. RPC latency compounds quickly when multiple rounds of RPCs are needed, i.e. during a log entry disagreement. In Raft, latency is mostly caused by the network. A lock (mutex, spinlock etc.) usually adds less than 5ms of latency, whereas an RPC could be a couple of times as slow. Ironically, we can run hundreds of thousands of instructions in that time on most consumer CPUs today. There is a big price to pay to run distributed systems.

Channels and latency

Go channels can be a drag in a latency-sensitive environment. The sending party usually goes into sleep immediately after putting a message into a channel. The receiving end, on the other hand, may not be awoken immediately.

I noticed this pattern when counting election votes. I have a goroutine watching the RPC response of RequestVote. It sends a "please count this vote" message to another goroutine after receiving the RPC response. What often happens is that the first goroutine would print "I received a vote", but the corresponding "I counted this vote" message never shows up before the next election starts. The scheduling behavior is a bit mysterious to me. In contrast, the cause and effect is much more clear and direct when I use a condition variable.

RPCs must be retried

RPC failures can be common when the network is unstable. Retrying RPCs helps mitigate the risk of staying unavailable longer than necessary.

The downside of RPC retries is that if everybody is retrying with high frequency, a lot of network bandwidth will be wasted on sending the same RPCs over and over again. The solution is exponential backoff. The first retry should be n ms after an RPC failure, the second retry should be 2n ms after that, then 4n and so on. The wait time grows quickly, thus only a linear amount of RPC (2/n) would be sent in a unit of time.

Another down side of retries is the "stay down" situation. When an RPC server is down, the RPCs ought to be processed during that time will be withheld by clients. When the server is up again, a huge amount of retries arrive at the exact same time. That could easily overwhelm the RPC server's buffer, and cause it to die again. The solution is to add a randomized component to exponential backoff. That way RPCs will arrive at slightly different times, allowing some processing time. Together the technique is called randomized exponential backoff, and is proven to be effective.

No unnecessary RPCs

The other extreme of handling RPCs is sending as many of them as possible. This is obviously bad, especially when the network is congested already. I once tried to make the leader send one heartbeat immediately after another, to minimize new elections when heartbeats are dropped by the network. It turns out even a software simulated network has its bandwidth limit. The leader ended up not being able to do anything other than sending heartbeats.

Another experiment I did was worse. In my implementation, the leader starts syncing logs whenever a new entry is created by clients. Accidentally I made it backtrack (i.e. sending another RPC with more logs) when the RPC timeouts, and at the same time retry the original RPC as well. The number of RPCs blowed up exponentially, because one failed RPC causes two being sent. The goroutine resource was quickly exhausted when a few new entries were added at the same time. That is, I reached the maximum number of goroutines that can be created. I have since reverted to retrying in an RPC failure, and backtracking only when the recipient disagrees.

Condition Variables

Mutexes and condition variables are handy

Mutex is a well known concept. Though I have never heard of condition variables before, at least not in the form that comes with Go. A condition variable is like a semaphore, in the sense that you can wait and signal on it (PV Primitives. The difference (among others) is that after a signal is received, a condition variable automatically acquires the lock associated with it. Acquiring the lock is both convenient, and a requirement for correctness. The reasoning is that the condition can only be safely evaluated while holding the lock, so that the data underneath the condition is protected.

When implementing Raft, there are many places where we need to wait on a condition. Once that condition is met, we need to acquire the global lock that guards core states. Condition variables fit perfectly into this type of usage.

But sometimes I only need a semaphore ...

There is one case where I don't need the locking part of a condition variable: the election. After sending out requests to vote, the election goroutine will block until one of the following things happen

  1. Enough votes for me are collected,
  2. Enough votes against me are collected,
  3. Someone has started a new election, or
  4. We are being shut down.

Those four things can be easily implemented with atomic integer counters. We do not need to acquire the global lock to access or modify those counters. What I really need is a semaphore that works on a goroutine.

It can be argued that after one of those things happen, we might need to acquire the global lock, if we are elected. That is true. Depending on the network environment, being elected may or may not be the dominating result of an election. The line is a bit blurry.

I ended up creating a lock just for the election. The lock is also used to guarantee there is only one election running, which is a nice side effect.

Signal() is not always required?

It appears that in some systems other than Go, condition variables can unblock without anyone calling Signal(). The doc of Wait() says

Unlike in other systems, Wait cannot return unless awoken by Broadcast or Signal.

I'm wondering why that is. In those systems, extra caution should be taken before an awoken goroutine makes any moves.

Critical Sections

Critical sections are the code between acquiring a lock and releasing it. No lock should be held for too long, for obvious reasons. The code within critical sections thus must be short.

When implementing Raft, it is rather easy to keep it short. The most complex thing that requires holding the global lock is copying an entire array of log entries. Other than that, the only thing left is copying integers in core states. Occasionally goroutines are created while holding the lock, which is not ideal. I’m just too lazy to optimize it away.

Goroutine is not in the critical section

In the snippet below, given that the caller holds the lock, is the code in the goroutine within the critical section?

rf.mu.Lock()
go func() {
    term := rf.currentTerm // Are we in the critical section?
}
rf.mu.Unlock()

The answer is no. The goroutine can run at any time in the future, maybe long after the lock is released on the last line. There is no guarantee that the lock is still being held by the caller of the Go routine. Even if it does, the goroutine is still asynchronous to its caller.

As a principle, do not access protected fields in a goroutine, or hold the lock when doing so.

Miscellaneous

There is potentially a bug in the testing framework

The potential bug causes new instances not being added to the simulated network.

If I run the 'basic persistent' test in 2C 100 times, when a server is supposed to be restarted and connected to the network, there is a ~2% chance none of the other servers could reach it. I know the other servers were alive, because they were sending and receiving heartbeats. The tests usually ended up with timing-outes after 10 minutes. This error happens more frequently if I increase the rate of sending RPCs.

It could also be a deadlock in my code. I have yet to formally prove the bug exists.

Disk delay was not emulated

In 6.824, persistent() is implemented in memory. I can pretty much call it as often as I want, without any performance impact. But in practise, disk delay can also be significant if sync() is called extensively.

Corrections

When writing all of those down, I noticed that while my implementation passes all the tests, some of it might not be the standard way of doing things in Raft. For example, log entry syncing should really be triggered by timer only. I plan to correct those behaviors, and please read at your own risk. :-)

]]>
<![CDATA[Raft, from an engineering perspective]]>I recently completed an implementation of the Raft consensus algorithm! It is part of the homework of the online version of MIT course 6.824. It took me 10 months on and off, mostly off.

The algorithm itself is simple and understandable, as promised by the paper. I'd

]]>
https://frozen-ghost.pages.dev/raft-from-an-engineering-perspective/5fe4be9ac3162700016a7c23Sun, 16 Aug 2020 15:20:00 GMT

I recently completed an implementation of the Raft consensus algorithm! It is part of the homework of the online version of MIT course 6.824. It took me 10 months on and off, mostly off.

The algorithm itself is simple and understandable, as promised by the paper. I'd like to summarize my implementation, and share my experience as an engineer implementing it. I wholeheartedly trust the researchers on its correctness. The programming language I used, as required by 6.824, is Go.

Raft

Raft stores a replicated log and allows users to add new log entries. Once a log entry is committed, it will stay in the committed state, and survive power outages, server reboots and network failures.

In practice, Raft keeps the log distributed between a set of servers. One of the servers is elected as the leader, and the rest are followers. The leader is responsible for serving external users, and keeping followers up-to-date on the logs. When the leader dies, a follower can turn into the leader and keep the system running.

Core States

In the implementation, a list of core states are maintained on each server. The states include the current leader, the log entries, the committed log entries, the last term and last vote, the time to start an election, and other logistic information. On each server, the states are guarded by a global lock. Details are here.

The states on each server are synchronized via two RPCs, AppendEntries and RequestVote. We'll discuss those shortly. RPCs (remote procedure calls) are requests and responses sent and received over the network. It is different from function calls and inter-process communication, in the sense that the latency is higher and RPCs could fail arbitrarily because of I/O.

Looking back at my implementation, I divided Raft into 5 components.

Election and Voting

Responsible for electing a leader to run the system. Arguably the most important part of Raft.

An election is triggered by a timer. When a follower has not heard from a leader for some time, it starts an election. The follower sends one RequestVote RPC to each of the peers, asking for a vote. If it collects enough votes before someone else starts a new term, then it becomes the new leader. To avoid unnecessary leader changes, the timer will be reset every time a follower hears from the current leader.

Raft, from an engineering perspective

Asynchronous operations can lead to many pitfalls. Firstly, If an election is triggered by a timer, we could have a second election triggered when the first is still running. In my implementation, I made an effort to end the prior election before starting a new one. This reduces the noise in the log and simplifies the states that must be considered. It is still possible to code it in a way in which each election dies naturally, though.

Secondly, latency matters in an unreliable network. A candidate should count votes ASAP when it receives responses from peers, and a newly-elected leader must notify its peers ASAP that it has collected enough votes. Using a channel in those scenarios can introduce significant delays, to the point that elections could not be reliably completed within the usual limit of 150ms ~ 250ms.

Thirdly, when the system is shut down, an election should be ended as well. Hanging elections confuses peers, and more importantly, also confuses the testing framework of 6.824 that evaluates my implementation.

Heartbeats

To ensure that followers know the leader is still alive and functioning, the current leader sends heartbeats to followers. Heartbeats keep the system stable. Followers will not attempt to become a leader while they receive heartbeats. Heartbeats are triggered by the heartbeat timer, which should expire faster than any followers' election timer. Otherwise those followers will attempt to run an election before the leader sends out the heartbeat.

In my implementation, one "daemon" goroutine is created for each peer, with its own periodical timer. The advantage of this design is that peers are isolated from each other, so that one lagging peer won't interfere with other peers.

The leader also sends an immediate round of heartbeats after it has won an election. This round of RPCs is implemented as a special case. It does not even share code with the periodical version.

The Raft paper did not design a specific type of RPC for heartbeats. Instead, it uses an AppendEntries RPC with no entries to append. The original purpose of AppendEntries is to sync log entries.

Log Entry Syncing

The leader is responsible for keeping all followers on the same page, by sending out AppendEntries RPCs.

Unlike heartbeats, log entry syncing is (mainly) triggered by events. Whenever a new log entry is added by a client, the leader needs to replicate it to followers. When things run smoothly, a majority of the followers accept the new log entry. We can then call that entry "committed". However, because of server crashes and network failures, sometimes followers disagree with the leader. The leader needs to go back in the entry log, find the latest entry that they still agree on ("common ground"), and overwrite all entries after that.

Raft, from an engineering perspective

Finding "common ground" is hard. In my implementation this is a recursive call to the same tryAppendEntries function. The function sends an AppendEntries RPC and collects the response. In case of a disagreement, it backtracks up the log entry list exponentially. First it goes back 1 entry, then X entries, then X^2 entries and so on. The recursion will not go too deep because of the aggressive "backtrack" behavior. This does mean a lot of the entries will be sent over the network repeatedly, which is less efficient.

The aggressive backtrack behavior is mainly designed for the limits set by the testing framework. In some extreme tests, an RPC can be delayed by as much as 25ms, or be dropped randomly, or never return. The network is heavily clogged. An election is bound to start in about 150ms after a leader has won, when heartbeat RPCs fail and one of the election timers triggers. That means the current leader only has ~6 RPCs (150ms / 25ms) to communicate with each peer, fewer if some RPCs are randomly lost in the network. The "backtrack" function really needs to go from 1000 to 0 in less than 6 calls. I imagine it will be tuned very differently, if the 95 percentile RPC latency to the same cell is less than 5ms.

AppendEntries RPC are so important that they must also be monitored by a timer. In some RPC libraries, an RPC can fail with a timeout error, and the timeout can be set by the caller. Unfortunately labrpc.go that comes with 6.824 does not provide such a nice feature. I implemented the timer as part of the Heartbeat component, which checks the status of log sync before sending out heartbeats. If logs are not in sync, tryAppendEntries RPCs are triggered instead of heartbeats.

Like heartbeats, each peer should have its own 'daemon' goroutine that is in charge of log syncing. The heartbeat daemon could share the same goroutine with it. However I did not find a way to wait for both a ticking timer and an event channel at the same time. Let me know if you know how to do that! Another thing is that my obsolete "all peers bundled together" system worked good enough. I did not bother to upgrade.

Internal RPC Serving

We talked about how to send AppendEntries and RequestVote RPCs. But how are those RPCs answered?

The Raft protocol is designed in a way that the answer can be given just by looking at a snapshot of the core states of the receiving peer. There is no waiting required, except for grabbing the lock local to each peer. The only twist is that receiving those two RPC calls can result in a change of core states. If other components are designed to expect state change at any time, there is nothing to worry about.

External RPC Serving

Only the leader serves external clients. Each peer should forward "start a new log entry" requests to the current leader. This part is not required by 6.824 and not implemented.

In reality, clients should communicate with the system via RPCs. Just like internal RPC serving, the implementation should be straightforward.

The 6.824 testing framework also requires each peer to send a notification via a given Go channel, when a log entry is committed. I don't think this requirement applies to a real world scenario. This part is implemented as one daemon goroutine on each peer. It is made asynchronous because it communicates with external systems which might be arbitrarily slow. No RPC is involved.

Conclusion

Coding is fun. Writing asynchronous applications is fun. Raft is fun.

That concludes the summary. Stay tuned for my thoughts and comments!

]]>
<![CDATA[文化不适]]>https://frozen-ghost.pages.dev/wen-hua-bu-gua/5fe4c584c3162700016a7c64Sat, 15 Aug 2020 15:45:00 GMT标题是从 Culture Fit 生硬翻译过来的。哦,应该是 unfit。

我最近(9个月前)换了组。新的组里有一大一小两个 TL (Tech Lead,可以翻译成“技术带头人”)。就管他们叫大小王吧,国王的王。在这九个月里,我观察到了一些让我不适的“文化”。

小王和我做项目。周一,他要求我实现一个改进版的想法,我和他深入讨论(争执)了之后,觉得有一定道理。他也跟合作的姐妹组开了会,通报了他的想法,得到了肯定。小王对这个想法相当兴奋。

周五我们和大王一起开会,讨论我们的项目。会议进行到一半,大王简短地问为什么要改进,原来的计划也很好嘛。紧接着大王提议可以按原来的计划做,然后再做改进版。小王当场把我们的下一步计划改了回去,一句多余的解释都没有。接受程度之高让我吃了一惊。

这已经不是我第一次遇到这种情景了。私下里,我跟大王提到对这种“不解释”行为的不解。大王说他自己没有注意到类似的事情,也没有在和其他 leader 开会的时候注意到类似的事情。他的解释是,可能只是工作习惯不同,”picking the right battle to fight”,也可能是为了尽快争取支持,还可能是因为两个提议差别不大,更可能是因为没有时间讨论。我说那我还是有挺多需要学习的。

上次发生这种事的时候,小王的指导的一个项目被搁置,负责执行的同事换了主力项目。当时开会甚至都不是为了讨论小王的项目,负责执行的同事也没有在场。我在我司这么多年,从来没听说过这种事情。

我还遇到过,小王下手改我的个人备忘录,告诉我下周要做什么,下下周要做什么。我跟我老板抗议,她说她也会这样做。嗯,仔细想想她也经常提非常非常细节的要求。

大小两个王都是 L5,我也是 L5。大王还比较年轻一点。

真的不适。

]]>
<![CDATA[新博客 New Blog]]>打算写点什么

Write random stuff

]]>
https://frozen-ghost.pages.dev/xin-bo-ke/5fe2c3cac3162700016a7c01Sat, 15 Aug 2020 15:13:00 GMT打算写点什么

Write random stuff

]]>