Async/Await in Libra Core

Async/Await in Libra Core

A couple of months ago I promised I would write up an experience report of Libra’s use of Async/Await. Its taken me a bit longer than I had hoped to find time to write this up, but here we go!

The Decision to Use Async/Await

Back before we made the decision to switch to using Async/Await and Futures 0.3 we were already using Futures 0.1 in the network stack and in many other layers of the system. As I’m sure most of you know asynchronous programming using Futures 0.1 is mostly done via using a set of combinators or by hand writing a state machine yourself. Most people don’t want to hand write state machines so most of our asynchronous code was written using combinators. One of the bigger challenges with using combinators is the inability to borrow as you normally would in non-futures rust code. This means that if you have a data structure that needs to be shared and used in the bodies of multiple combinators that you either have to pipe it through the input of each combinator and output of each produced future or you take the easy route and end up using lots of Arc<Mutex<T>> all over the place. A very overly simplified example would look something like this:

let shared = Arc::new(Mutex::new(HashMap::new()));
let shared_cloned = shared.clone();

let f = stream.for_each(move |item| {
    // Do something with `item` and produce another future
    
    ... shared.lock() ...
})
.and_then(|item| {
    // Do something with `item` and produce another future
    
    ... shared_cloned.lock() ...
})
.map_err(|err| {
    // All tasks must have an `Error` type of `()`.
});

tokio::spawn(f);

No one liked reading or writing code like this so back at the beginning of the year we look another look at the state of Async/Await support in Rust to see if we could start using it. As an experiment we rewrote part of our network stack using Async/Await and Futures 0.3 and quickly saw how much more enjoyable it was to write asynchronous code. It solved many of the ergonomic issues we had with Futures 0.1 and enabled us to write clearer code and enabled use to eliminate a ton of instances of Arc<Mutex<T>> in addition to a nice productivity boost when writing asynchronous code. For example the above could more easily be written as:

let f = async move {
    let map = HashMap::new();
    while let Some(item) = stream.next().await {
        // Do something with `item`
        ... map.contains_key(&item) ...
        // Call an async fn that borrows `map` and uses `item`
        some_async_fn(&mut map, item).await;
    }
};

tokio::spawn(f.boxed().unit_error().compat())

So after our short experiment we decided to get the rest of the team using Async/Await and migrating away from Futures 0.1.

Education and Migration

In order to help migrate a bunch of engineers to using Async/Await we wrote up a doc (which is outdated by now) to help educate and to address some (at the time) shortcomings and potential workarounds when using Async/Await. The main sections of the doc are Terminology, Compatibility Layer, and Workarounds.

Surprisingly one of the greatest sources of confusion while migrating was due to people not fully understanding the differences between the different parts of the asynchronous ecosystem (Async/Await, Tokio, Futures 0.1, Futures 0.3, etc) so it was very important that we fully explained the terminology and what each part of the ecosystem was and what purpose it served.

The compatibility layer between Futures 0.1 and std::future::Future provided by the Futures 0.3 crate was key to migrating the existing futures code we had to be operable with Async/Await. This allowed us to migrate pieces of code at a time as well as enabled us to still use the Tokio runtime. Without this compatibility layer I don’t believe we would have been able to have as smooth of a transition or even been able to convince some people to begin using Async/Await.

Once engineers began using Async/Await they began wanting to do things they were used to doing with normal rust code like:

  • Write async functions that support multiple lifetimes
  • Have async trait functions
  • Write Recursive async functions

So we also provided detailed workarounds to some of these limitations (where workarounds existed) and tried to provide a few tips for working with 0.3 Futures.

A Few Pain Points

Since we’ve switched to using Async/Await most engineers have found it much easier to get their job done, but every once in a while we were reminded that this was a feature still underdevelopment and ran into a few issues around debug-ability.

Generally I’ve found error messages from the compiler to be fairly helpful during the development process but some of the error messages you get when dealing with async code aren’t too friendly. One such example of an error would be sometime like the following:

error[E0277]: `T` cannot be sent between threads safely
   --> path/to/file.rs:188:22
    |
188 |                     .boxed()
    |                      ^^^^^ `T` cannot be sent between threads safely
    |
    = help: within `impl failure::core::future::future::Future`, the trait `std::marker::Send` is not implemented for `T`
    
   . . .
   
   [100 or so lines describing long, difficult to read types]

There’s usually enough information hidden in these long error messages to figure out what the error is (that T isn’t Send) but sometimes it can be difficult to understand where and why that particular type T is being captured and stored across await points in the state machine generated by the compiler. Digging through the error message and your code to figure out what you need to fix to get it to compile can be a nuisance at times but a more challenging issue is debugging asynchronous code at runtime.

The life of an asynchronous task being run on an executor looks something like:

  1. Executor drives task T until task T needs to wait for an asynchronous event.
  2. Task T orchestrates with whoever is responsible for watching for the event to notify the executor when task T will be able to make forward progress.
  3. When notified, the executor will drive task T again until either the task is complete or the task needs to wait for another asynchronous event.

The problem comes when you need to debug an asynchronous task which gets stuck in between steps 2 and 3 above. Given some of these asynchronous tasks could be quite large and have a number of different await points, its not trivial to be able to identify exactly which await point it’s stuck at and what it’s waiting for.

If the issue can be reproduced, it’s usually debugged by performing more logging or adding additional counters and metrics to help identify why a task was stalled. What would be helpful is the ability to easily inspect the state of an asynchronous task to be able to identify which await point it’s stuck at in order to help understand why no forward progress is being made.

One particular instance of this occurred while stress testing our network code. Libra’s network stack is written using the actor model and is comprised of a number of tasks which communicate with each other across channels. Under certain circumstances (under higher load) we found that some of the network actors would stall and we couldn’t figure out what these tasks were getting hung-up on. After adding a bunch of metrics and logs we found that many of the channels used for communication between these tasks were filling up and (because they were bounded channels) would exert back pressure on the sending task when full, stalling it until there was room in the channel for the message. When a number of these channels became full, it resulted in a deadlock scenario between a number of these tasks. This was a pretty easy mistake to make but turned out to be difficult to debug due to not being able to easily see why a task was blocked and waiting.

Conclusion

Overall working with Async/Await has been a huge improvement over working with the old style combinators and we’ve been able to more easily write and review asynchronous code. We hit a couple of pain points along the way but we’ve already seen huge improvements in the experience since we started using Async/Await back at the beginning of the year. We also anticipate that the experience will only continue improving once Async/Await has been stabilized (later this year in 1.39) as more people will be able to use and help improve this ecosystem.

13 Likes

Many thanks for explanation, In process of catching up feature 0.3.

Which implementation of the Actor model are you using?

Hi @najamelan,

As per reading source code and bmwill’s explanation, I found that it’s Tokio.

tokio does not have an actor model implementation, but yes, it’s open source, I will have a look at the source! duh.

If you do check source code then may be look at fn start_event_processing the process was spawn and discuss via tokio framwork. I’m not sure what you mean by actor model implementation already. May be some one can help feeding more info for you :wink:

Feel free to share what kind of concurrent computation model they used. I would be interested in that as well. Thanks👍

From the looks of it, they have an infinitive loop over a match to event types, that calls async fns with the corresponding message.


No staticly typed actor model.

Also encounter the dead lock problems. main task is sending msg through channel to another task which inturn waiting sending msg to the main task.