<aside> 💬 This document is open for commenting.

</aside>

Table of contents

Intro

We want Transactions to always eventually complete, whether or not the undelying networking conditions allow that at the moment. Also we want to be able to upgrade the whole database without stopping all its shards first, to have minimal possible downtime. Both of this problems could be solved by a custom transport layer implemented on top of notify() API and Coroutines.

Notify-only inter-canister communications

<aside> <img src="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9e710720-07e3-41a1-995c-e06d7be5ba85/internet-computer-icp-logo.png" alt="https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9e710720-07e3-41a1-995c-e06d7be5ba85/internet-computer-icp-logo.png" width="40px" /> This algorithm could be implemented much more efficiently and pretty, if the following changes are made to System API:

  1. It should be possible to retrieve a request_id, produced by the system, when making a ic0::call_new inter-canister call.
  2. It should be possible to define a special #[ic_cdk::callback] lifecycle function, that is executed as backup, when -1 callback ptr is provided to ic0::call_new API.
  3. It should be possible to retrieve request_id of the current message or reply (e.g. ic0::msg_request_id()).
  4. It should be possible to ic0::msg_reply to a specific request_id, even being in a context of another request or no request at all (for example, in heartbeat).

With this four changes (please, correct this statement it is wrong) it would be possible to completely swap futures-based canister communications with coroutine-based communications. This might be a good alternative, since coroutines can live in stable memory and withstand canister upgrades while being in suspended state - without waiting for all interactions to complete.

Also, this will make it possbile for coroutine-based canisters to freely interact with any other canister. This means, that it would be possible to interact with, for example, management canister from such coroutine-based canisters.

</aside>

If a canister exclusively uses notify calls for all inter-canister calls, it is safe to upgrade such a canister without stopping it.

We fix this by replacing call() invocations with notify() invokations, but this is not so easy. Once we do that, we’re no longer able to check whether the data was received on the other side or not, nor to receive a response. So we have to implement this logic ourselves.

Since we’re developing a flow that will only apply to inter-canister communications within a single software unit (our database), we’re gonna assume that shards always eventually respond to incoming notifications. There should be only two reasons for a remote shard to not be able to respond: either it is out of cycles or the endpoint doesn’t exist or panics. Both of these are the responsibility of a particular database admin/developer which we can’t completely prevent.

In our scenario there are no malicious actors. If the remote shard’s subnet is under heavy load and unable to respond at the moment, we should continue to try to communicate with it, because at some moment in time it will come back online. These subtle, but important adjustments will allow us to implement this new communication flow in the most conflict-free way possible:

graph LR
	subgraph Receive Ack
		direction LR
		
		Start4(("`Start`"))
		Stop4(("`Stop`"))

		Start4 --> IsPersisted2

		IsPersisted2{"`
			A response 
			with __call_id__
			exists?
		`"}

		IsPersisted2 -- Yes --> DePersist["`De-persist the response`"]
		IsPersisted2 -- No --> Stop4
		DePersist --> Stop4
	end

	subgraph Receive Notification
		direction LR
	
		Start3(("`Start`"))
		Stop3(("`Stop`"))

		Start3 --> IsPersisted

		IsPersisted{"`
			A response 
			with __call_id__
			exists?
		`"}

		IsPersisted -- No --> Exec["`Execute the __update__ method that was requested`"]
		IsPersisted -- Yes --> Respond["`Respond to the notification with the response`"]

		Respond --> Stop3

		Exec --> Persist["`Persist the result of the __update__ method for 24 hours`"]
		Persist --> Respond
	end

	subgraph On Timer_0
		direction LR
		
		Start2(("`Start`"))
		Stop2(("`Stop`"))
	
		TimerRings{"`
			__timer_0__
			rings?
		`"}

		Start2 --> TimerRings
		TimerRings -- No --> TimerRings

		WasReplied{"`
			received __response__
			for __call_id__?
		`"}

		TimerRings -- Yes --> WasReplied

		Notify1["`__notify()__ the remote canister with __args__ and __call_id__`"]
		SetTimer1["`Set __timer_0__ to N1 = N0 * 2 seconds`"]

		MaxReached{"`
			__timer_0__
			reached 24
			hours?
		`"}

		MaxReached -- Yes --> GiveUp["`Give up - log fatal error, cleanup all depending coroutines`"]
		MaxReached -- No --> Stop2

		WasReplied -- No --> Notify1
		Notify1 --> SetTimer1
		SetTimer1 --> MaxReached
		WasReplied -- Yes --> Stop2
	end	

	subgraph Notify Function
		direction LR

	  Start1(("`Start`"))
		Stop1(("`Stop`"))
	
		Generate["`Generate __call_id__`"]
		Notify["`__notify()__ the remote canister with __args__ and __call_id__`"]
		Notify -.- Note
		Note(["`Remote shard invokes __receive_notification__`"])
		SetTimer["`Set __timer_0__ to N seconds`"]

		Suspend["`Suspend current coroutine until the __response__ is received`"]
		OverAndOut["`Send __ack__ for __call_id__`"]
		OverAndOut -.- Note2
		Note2(["`Remote shard invokes __receive_ack__`"])

		Received(["`The response is received`"])
		Resume(["`Resume current coroutine`"])

		Start1 --> Generate --> Notify --> SetTimer --> Suspend --> Received --> OverAndOut --> Resume --> Stop1
	end

An example of how this algorithm would work:

sequenceDiagram
	participant S1 as Shard 1
	participant S2 as Shard 2

	activate S1
		S1 ->> S2: notify() call with [call_id]
		S1 ->> S1: set timer_0 to 10 sec
	deactivate S1
	
	S2 ->> S2: unavailable or panics
	S2 ->> S2: fixed
	
	activate S1
		S1 ->> S1: timer_0 rings
		S1 ->> S2: notify() call with [call_id]
		S1 ->> S1: set timer_0 to 20 sec
	deactivate S1

	activate S2
		S2 ->> S2: process the request with [call_id]
		S2 ->> S2: persist response with [call_id] for 24 hours
		S2 ->> S1: respond() with [call_id]
	deactivate S2

	S1 ->> S1: unavailable or panics
	S1 ->> S1: fixed

	activate S1
		S1 ->> S1: timer_0 rings
		S1 ->> S2: notify() call with [call_id]
		S1 ->> S1: set timer_0 to 40 sec
	deactivate S1

	activate S2
		S2 ->> S2: restore the response with [call_id]
		S2 ->> S1: respond() with [call_id]
	deactivate S2

	activate S1
		S1 ->> S2: (optimization) send ACK with [call_id] once
	activate S2
		S2 ->> S2: (if ACK received) de-persist the response with [call_id]
	deactivate S2
		S1 ->> S1: unset timer_0
		S1 ->> S1: process the response with [call_id]
	deactivate S1

As you can see, this algorithm is resilient to temporary canister downtime (up to 24 hours) - the shard will continue to send notification (exponentially increasing the timeout after each one), until receives the response. This will help with a lot of things:

It should be possible to make this interval bigger (for example, to make it 48 hours), but for some serious situations even 24 hours is a lot, since it means that a lot of coroutines/responses will stack up in the storage, which may drain it completely, triggering the Rebalancing and making the whole situation even worse.