How LibraBFT should handle catastrophic global events

Libra’s mission is to enable a simple global currency and financial infrastructure that empowers
billions of people. (source)

If Libra is going to become the financial infrastructure for billions of people across the globe, I believe there needs to be much more emphasis on building a system that is resilient even in the worst case scenarios.

With a maximum number of nodes only on the order of 10^2, the possibility of a catastrophic global event permanently kicking off over 1/3 of the network is real. I don’t see any anecdotes in the whitepaper about what happens if a quorum cannot be reached.

This is an unacceptable outcome for the new financial infrastructure of the world. I respect the design of Ethereum 2.0, because it is built to be unstoppable – even in the harshest conditions. I believe LibraBFT can take some inspiration from this.

Based on Ethereum 2.0’s approach of decaying validator stake, we should be able to construct a naive solution to this problem using both LibraBFT and LMD GHOST.

Resilient LibraBFT

Nt is defined as the maximum number of consecutive timeouts / proposal failures. Nt = f+1 should be an appropriate value since there should never be more than f Byzantine validators. In the event of a catastrophe which destroys > f validators, Nt consecutive timeouts should occur relatively quickly.

Once Nt timeouts are detected, the BFT algorithm will continue attempting to reach a 2f+1 consensus on new block proposals. Upon each additional timeout, the leader will broadcast the proposed block with an aggregation of signatures who voted. The number of votes for the proposal will be recorded by each node and be used to compute longest chain of unfinalized blocks using the LMD GHOST fork choice rule.

Under this new consensus scheme, each validator that does not cast a vote for the current block proposal will be slashed some amount S. Where that slashed S goes is out of scope of this post. The goal of the slashing is to decay the stake of the offline validators so that once their stake drops below a threshold T, they are removed from the consensus process. That will allow the system to eventually return to some equilibrium where 2f + 1 validators can form a quorum and finalize new blocks. The first newly finalized block should point to the last known QC and include all of the valid transitions that occurred during the progression via LMD GHOST.

This new hybrid consensus mechanism will be able to continue progressing without finalization immediately following a catastrophic disaster. The construction does introduce many complex edge cases around the fork-ability during the time period where unfinalized progress is made. I argue that despite the complexities, these edge cases should be resolved and Libra should be built with all outcomes in mind.

Any comments or feedback is welcome.

8 Likes

If Libra is going to become the financial infrastructure for billions of people across the globe, I believe there needs to be much more emphasis on building a system that is resilient even in the worst case scenarios.

:100:

This is absolutely something that’s top of mind. We need to plan for these situations and drill to make sure that they work.

In fact, we’ve already included a provision in the proposal on association governance to solve what IMHO is a likely subcase of this — where the drain of validators happens slowly over time:

In order to prevent the number of inactive validator nodes in the network from growing to a level that could jeopardize the effectiveness of the consensus protocol, any member whose node has not participated in the consensus algorithm for 10 consecutive days may be automatically removed from the council by the Libra protocol. The member is free to rejoin once their node is operational.

Of course, this doesn’t solve the all-at-once version of this problem.


I’m going to let the team members who actually know about BFT jump here but from an outsider the approaches you mention remind me of the BFT approaches that assume a client can communicate with honest nodes within some time \Delta and use this to do BFT with 2f+1 replicas. Similar to those approaches, this approach requires the client to have a source of time and to assume that it can communicate with honest nodes within some bound (otherwise a malicious node could say “look, I’m the only node that’s been up for the past month” and simply lie about the liveness of the honest nodes). However this approach extends the classical result by handling an arbitrary number of failures.

https://dahliamalkhi.wordpress.com/2019/04/24/flexible-byzantine-fault-tolerance/ is an interesting recent result from one of our team members. This result shows that different clients can make different tradeoffs between liveness and security. This feels like an important piece of the puzzle. For example, if Calibra is reading the blockchain for it’s wallet product, we’d probably prefer to avoid relying on liveness assumptions and instead trigger a pager alert to our team if we detect that there have been > f replicas failed for some period of time. Given how rare a catastrophic failure is it’s worth taking a downtime hit in order to confirm the correct handling of the situation. OTOH an end user probably just wants the software to do the right thing.


One last thought here: given how rare this type of occurrence is, I wonder if it’s worth handling like a security update — something that you have to look to an external source of trust for. Tuning S seems like it could make it tricky to provide liveness (not clear if we’ve really handled a catastrophic event if it takes 2-3 days to get operational) but also provide sufficient safety against eclipse attacks. An approach that we’ve thought about is how we might be able to handle this kind of hard fork with minimal code changes. i.e. could we have a way of saying “reconfigure your client to sync to transaction 12345 then change the validator set to X, Y, Z”. This type of config update could be pushed relatively quickly.

6 Likes

I like your idea. One more question. Based on your hybrid consensus mechanism, the security of libra network is severely damaged, by not only disaster but also consensus mechanism. If the size of validators decreases during catastrophic event, because of slashing, we encounter the trade-off between safety and liveness. It means we should accept one of two situations, one is that libra doesn’t work or the other is that network is vulnerable to malicious player due to only 2/3 active validators. Cause there’s only 66 validators!
I think some extent, your suggestion would work, but in extreme cases like only 10 validators in network, it is dangerous to let system work. In result, in this case, only 100 validator situation, slashing feature can be potential danger. In the case of Ethereum, it’s public network and there is tons of miners, so I think it’s different case.
I am not the expert on Ethereum and BFT, so I may miss some details or common knowledge about it. I would be pleasure to get feedback from community and author.
Cheers

3 Likes

This is an important topic to consider, thanks for bringing it up. I have a few ideas and would appreciate your thoughts.

  1. As long as the validators have still have access to their keys (ideally, they were globally replicated), they could bring up their validators in areas not affected by disaster and rejoin the network.

  2. As Ben mentions, it would be difficult to support slashing due to a non-action (e.g. not voting) as we assume a partially synchronous system where we can not differentiate between a slow (or attacked) network or down machines. Additionally, in LibraBFT, a leader collects the votes. A malicious leader could pretend to not receive votes to cause such slashing to occur. Generally, how does an honest validator prove to others that it voted for a proposal in time to avoid the slashing?

  3. +1 to hard fork as an option in such a disaster as Ben mentions.

2 Likes

This does bring up a good point, will Libra eventually support pluggable consensus like some Hyperledger projects, or go the path of Corda with member-defined / swappable Notary Services @bmaurer ?

Fabric started to support SBFT after their 1.0 release, however, being a private permissioned chain, it is a bit different than Libra in that way (what are we calling it? public permissioned?) where to be completely fault tolerant against threat actors and catastrophic network failures, there needs to be 3f+1 validators in the network at all time and at least 2f+1 have to reach consensus.

SBFT in that way is not only HA & fault tolerant, but it can scale up to a large number of Nodes and it is great for low-latency finality especially with a saturated network that can have orders of magnitude higher TPS than what the public testnet is facing right now.

(Of course, Fabric needed that because they use Apache Kafka as an ordering service, where Validators on the Channel would execute transactions in a sandbox ENV (RW Sets) and send it to Kafka streams before the “committer” validator nodes will get it). Though, by having almost 2 rounds of validation between the members of the network and the ordering service, you do pick up extra security against attacks on integrity and availability.

This also brings up another point besides Pluggable Consensus, how is off-chain storage going to be handled? To @aching 's point #1 - Validator node recovery (think of it as auto-scaling / desired state orchestration) is a very viable option, especially if the Validator Nodes are containerized, that opens up a ton of deployment mechanisms, but how are they going to access the CA for their certs, and also access the world state at time of failure? There is no such thing as real time recovery, so that would probably still force a hard fork in the case of catastrophe

And hard forks bring up a whole other can of worms especially with a fiat-staked cryptocurrency like Libra - the DAO Hard Fork with Ethereum Classic wasn’t too bad since Ether is really a token to perform transactions, and being POW there were block awards to incentiveize folks to stick around after the hard fork. With Libra, a hard fork is going to come under INTENSE public scrutiny (let’s face it, even “crypto-influencers” largely don’t understand the under-the-covers of a blockchain) and then you would need to true-up coins in circulation and do cost recovery (another reason for an off-chain world state to be persisted if coins would need to be reminted and distributed)

3 Likes

Sorry for the delayed response, I’ve had this half-written in my drafts for the last week but haven’t had a moment to finish it!

Maybe if I describe this construction in two scenarios, it will become clearer whether I am making an invalid assumption about LibraBFT or if this holds true.

Scenario 1

Let’s assume there are 3f + 1 validators with a single vote and f of them are malicious. This is the worst case scenario for the 2/3 honest majority assumption to hold true.

Since f votes can neither create a QC for an invalid block nor stall progress, they try to destabilize the network by forcing LMD GHOST consensus to begin.

In order to force LMD GHOST, there must be N_t consecutive timeouts. To achieve a timeout, a quorum must be formed. Since 2/3 of the validators are still honest, as soon as one of them become the leader a new block will be proposed and a QC will be able to be generated. So we’ll need to assume that the next N_t rounds is lead by a malicious node. This would yield P = [f/(3f+1)]^(N_t), where P is the probability that f malicious votes would be able to force a healthy system into LMD GHOST. A sufficiently large N_t should ensure P is negligible.

Scenario 2

Suppose a catastrophe does occur which permanently destroys some number of nodes N_d such that N_d > f. Of the remaining nodes N_r, N_r/2 are malicious and N_r/2 + 1 are good. Neither faction can reach a 2f + 1 quorum to advance the chain nor timeout.

In order to avoid clock synchronization, each node will need to decide at which time T it will broadcast its timeout message. Since no finalization will be occurring, this should be fine with respect to the LMD GHOST protocol. Each node will begin validating and building on the LMD GHOST chain when it decides N_t * T has passed. Although the nodes are assumed to be honest, they won’t be able to take advantage of being one of the first to timeout by double spending since the blocks they produce would not be finalized and will not be attested to by the honest nodes as they all begin to timeout. After some delta D it can be assumed that all honest nodes have timed out and are able to contribute to the LMD GHOST consensus.

Since we made the assumption of N_r/2 malicious nodes and N_r/2 + 1 honest nodes, we can be sure that the “honest” chain will be the heaviest in the long run.

Summary

  • Entering LMD GHOST consensus should be set such that the probability of a malicious faction of < f nodes triggering it will be statistically insignificant.
  • Nodes join LMD GHOST consensus after they observe N_t * T time passing, and they begin validating and attesting to the new chain. Since node joins will be staggered, there are no synchrony assumptions.
  • As long as N_r/2 nodes are malicious and N_r/2 + 1 nodes are honest, the honest chain will be the heaviest chain in the long run.
  • It is important to note that classic probabilistic thresholds should be met before relying on an LMD GHOST block (e.g. number of block confirmations). Applications and services should take this into consideration before relying on a block that hasn’t been officially finalized with a QC.

Additional Responses

i) This would only come into play after LMD GHOST consensus kicks, which should really only be in the event of something extraordinary and ii) we can follow the Ethereum 2.0 scheme where validators have a window to attest to blocks across multiple producers. If they can’t attest within that window for some reason, they wouldn’t meet the honesty assumption.

The Libra Association is overwhelmingly U.S. based and could be wiped out via a coordinated attack. We would hope that these organizations would have fail safes of their keys around the world, but that is not a guarantee. Additionally, the response time of gaining access to those keys and getting a new validator up-and-running would be non-trivial.

This could be a possibility, and may be a valuable plan B while the LMD GHOST consensus is a plan C or Plan D. There are some difficult questions that need to be answered here like i) what is the protocol for pushing out an emergency update ii) how long will it take to fork iii) how are the funds of ejected validators handled iv) what if the people who are qualified to perform this update are AFK (for good)? I still believe that all modern financial infrastructure should have a failsafe that allows it to proceed autonomously and deterministically in the event that humans are unable to do the right thing.

Since the system is permissioned (the Libra Association must induct new validators), I believe that even 10 validators will be acceptable as long as 51% are honest.