<aside> 💬 This document is open for commenting.
</aside>
We want Transactions to always eventually complete, whether or not the undelying networking conditions allow that at the moment. Also we want to be able to upgrade the whole database without stopping all its shards first, to have minimal possible downtime. Both of this problems could be solved by a custom transport layer implemented on top of notify()
API and Coroutines.
<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9e710720-07e3-41a1-995c-e06d7be5ba85/internet-computer-icp-logo.png" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9e710720-07e3-41a1-995c-e06d7be5ba85/internet-computer-icp-logo.png" width="40px" /> This algorithm could be implemented much more efficiently and pretty, if the following changes are made to System API:
request_id
, produced by the system, when making a ic0::call_new
inter-canister call.#[ic_cdk::callback]
lifecycle function, that is executed as backup, when -1
callback ptr is provided to ic0::call_new
API.request_id
of the current message or reply (e.g. ic0::msg_request_id()
).ic0::msg_reply
to a specific request_id
, even being in a context of another request or no request at all (for example, in heartbeat
).With this four changes (please, correct this statement it is wrong) it would be possible to completely swap futures-based canister communications with coroutine-based communications. This might be a good alternative, since coroutines can live in stable memory and withstand canister upgrades while being in suspended state - without waiting for all interactions to complete.
Also, this will make it possbile for coroutine-based canisters to freely interact with any other canister. This means, that it would be possible to interact with, for example, management canister from such coroutine-based canisters.
</aside>
If a canister exclusively uses notify calls for all inter-canister calls, it is safe to upgrade such a canister without stopping it.
We fix this by replacing call()
invocations with notify()
invokations, but this is not so easy. Once we do that, we’re no longer able to check whether the data was received on the other side or not, nor to receive a response. So we have to implement this logic ourselves.
Since we’re developing a flow that will only apply to inter-canister communications within a single software unit (our database), we’re gonna assume that shards always eventually respond to incoming notifications. There should be only two reasons for a remote shard to not be able to respond: either it is out of cycles or the endpoint doesn’t exist or panics. Both of these are the responsibility of a particular database admin/developer which we can’t completely prevent.
In our scenario there are no malicious actors. If the remote shard’s subnet is under heavy load and unable to respond at the moment, we should continue to try to communicate with it, because at some moment in time it will come back online. These subtle, but important adjustments will allow us to implement this new communication flow in the most conflict-free way possible:
graph LR
subgraph Receive Ack
direction LR
Start4(("`Start`"))
Stop4(("`Stop`"))
Start4 --> IsPersisted2
IsPersisted2{"`
A response
with __call_id__
exists?
`"}
IsPersisted2 -- Yes --> DePersist["`De-persist the response`"]
IsPersisted2 -- No --> Stop4
DePersist --> Stop4
end
subgraph Receive Notification
direction LR
Start3(("`Start`"))
Stop3(("`Stop`"))
Start3 --> IsPersisted
IsPersisted{"`
A response
with __call_id__
exists?
`"}
IsPersisted -- No --> Exec["`Execute the __update__ method that was requested`"]
IsPersisted -- Yes --> Respond["`Respond to the notification with the response`"]
Respond --> Stop3
Exec --> Persist["`Persist the result of the __update__ method for 24 hours`"]
Persist --> Respond
end
subgraph On Timer_0
direction LR
Start2(("`Start`"))
Stop2(("`Stop`"))
TimerRings{"`
__timer_0__
rings?
`"}
Start2 --> TimerRings
TimerRings -- No --> TimerRings
WasReplied{"`
received __response__
for __call_id__?
`"}
TimerRings -- Yes --> WasReplied
Notify1["`__notify()__ the remote canister with __args__ and __call_id__`"]
SetTimer1["`Set __timer_0__ to N1 = N0 * 2 seconds`"]
MaxReached{"`
__timer_0__
reached 24
hours?
`"}
MaxReached -- Yes --> GiveUp["`Give up - log fatal error, cleanup all depending coroutines`"]
MaxReached -- No --> Stop2
WasReplied -- No --> Notify1
Notify1 --> SetTimer1
SetTimer1 --> MaxReached
WasReplied -- Yes --> Stop2
end
subgraph Notify Function
direction LR
Start1(("`Start`"))
Stop1(("`Stop`"))
Generate["`Generate __call_id__`"]
Notify["`__notify()__ the remote canister with __args__ and __call_id__`"]
Notify -.- Note
Note(["`Remote shard invokes __receive_notification__`"])
SetTimer["`Set __timer_0__ to N seconds`"]
Suspend["`Suspend current coroutine until the __response__ is received`"]
OverAndOut["`Send __ack__ for __call_id__`"]
OverAndOut -.- Note2
Note2(["`Remote shard invokes __receive_ack__`"])
Received(["`The response is received`"])
Resume(["`Resume current coroutine`"])
Start1 --> Generate --> Notify --> SetTimer --> Suspend --> Received --> OverAndOut --> Resume --> Stop1
end
An example of how this algorithm would work:
sequenceDiagram
participant S1 as Shard 1
participant S2 as Shard 2
activate S1
S1 ->> S2: notify() call with [call_id]
S1 ->> S1: set timer_0 to 10 sec
deactivate S1
S2 ->> S2: unavailable or panics
S2 ->> S2: fixed
activate S1
S1 ->> S1: timer_0 rings
S1 ->> S2: notify() call with [call_id]
S1 ->> S1: set timer_0 to 20 sec
deactivate S1
activate S2
S2 ->> S2: process the request with [call_id]
S2 ->> S2: persist response with [call_id] for 24 hours
S2 ->> S1: respond() with [call_id]
deactivate S2
S1 ->> S1: unavailable or panics
S1 ->> S1: fixed
activate S1
S1 ->> S1: timer_0 rings
S1 ->> S2: notify() call with [call_id]
S1 ->> S1: set timer_0 to 40 sec
deactivate S1
activate S2
S2 ->> S2: restore the response with [call_id]
S2 ->> S1: respond() with [call_id]
deactivate S2
activate S1
S1 ->> S2: (optimization) send ACK with [call_id] once
activate S2
S2 ->> S2: (if ACK received) de-persist the response with [call_id]
deactivate S2
S1 ->> S1: unset timer_0
S1 ->> S1: process the response with [call_id]
deactivate S1
As you can see, this algorithm is resilient to temporary canister downtime (up to 24 hours) - the shard will continue to send notification (exponentially increasing the timeout after each one), until receives the response. This will help with a lot of things:
It should be possible to make this interval bigger (for example, to make it 48 hours), but for some serious situations even 24 hours is a lot, since it means that a lot of coroutines/responses will stack up in the storage, which may drain it completely, triggering the Rebalancing and making the whole situation even worse.