The End-To-End Argument

Akash Garg
4 min readFeb 11, 2021

Most modern application utilize cloud in some form or the other. Infact cloud is now considered as an important part of delivering good user experience. Quite often cloud is used to develop some kind of APIs which can then be called by clients for various features. The cloud component consists of multiple microservices. These services provide different functionalities to the clients. These services also connect to each other to exchange data. It is well known that in majority of the cases the inter-service data volume is much higher than that of data served to the clients.

While connecting one such microservice to another I came across an interesting question. Why do microservices respond to each other even when they have nothing new to say? The response is important when some form of processing is requested or the microservice has some data to send in the response. But a lot of such APIs respond with an empty 200 OK response. Let us think of a simple file transfer from a service X to another service Y. X wants to send a file to Y so it tries to connect to Y. Since the file contains important data that has to be reliably transferred, X uses tcp protocol and hands over the file data to it. The tcp protocol forms a connection with Y and reliably sends the whole file in pieces. It also takes care that the data arrives in proper order and checksums ensure that the data is not corrupted. So, we have reliably sent a file from X to Y. At this point X also knows that transfer was successful, otherwise the tcp protocol on X would have thrown some kind of error. In most common applications that are developed today, Y now responds to X saying that transfer was successful. This is what got me thinking. Why exactly is Y responding?

After pondering over this question for sometime, I came across another situation, which although initially seemed unrelated, but is closely linked to this. I read a protocol specification for establishing mesh networks in IoT enabled homes. A common feature of many such protocols now is that they provide secure communication from one node to another. Having a little bit of background in security, I read all the details of security mechanisms used and it was fairly impressive. It was state of the art. I discussed with a senior colleague about how this can be used and one of his comment was surprising. He said the first thing needed to be done to use it is to develop security mechanism. Now basically the protocol is providing secure communication but still another layer of security is needed on top of it. This new layer of security cannot be more secure than the lower layer since it is already top notch. Why do we have another layer of security?

The two questions above represent the same thing. In the first case, tcp already provides reliable communication from X to Y consisting of acks for each data packet sent, but we still require Y to send a 200 OK response saying the transfer was successful. In the second case, even though a lower network layer provide secure communication, we still need to add another later on top.

Once this connection was clear, I was able to find a whole of situations which are essentially same. A very common example is the presence of sha256 hash while downloading packages from websites. Now there is already a checksum used by lower networking layers to ensure correctness but still another checksum is shown on the webpage which has the download link. Another example is FIFO message delivery.

All these properties i.e. reliability, correctness, security, fifo message delivery are already provided by lower networking layers. However, there is essentially a re-implementation of same thing at higher application layer. Let us take the example of reliability, since it is pretty familiar. In most networking courses, we are taught that tcp means reliable in-order delivery. As evident from the discussion above that is not the case. Tcp does not means reliable delivery. Let us do a thought experiment. What happens to correctness if we start using udp instead of tcp? Udp does not cares about reliable delivery. It drop packets pretty frequently. Shouldn’t the correctness at application layer get affected if only half of the data arrives? The answer is no. That is because reliability is not ensured by Tcp. Rather, it is the responsibility of application layer to ensure reliability. The application does this by getting the 200 OK(empty) response. That is an indicator that the other end has received the data correctly. If any error happens(packet dropped), the other application will respond with an error.

If application has to provide reliability, then what does tcp exactly do. Tcp may not be providing reliability, but it has to provide something. If it is providing nothing then why is it used. It provides a very important aspect of communication. It reduces cost of reliability. Imagine if udp is used to send a file and a single packet is dropped/corrupted somewhere by the network. That is very much possible as the underlying network is unreliable. The only option available to application is to resend the whole file again. With good networking equipments and low load such failures maybe rare, but they are not completely zero. After sending a 20MB file, X discovers that a single packet was dropped somewhere and now whole file has to be sent again. This is a very high cost. Tcp reduces this cost. It allows the application only send that single packet that was dropped rather than sending the whole file again.

This pattern repeats in all the above cases and many more. It is the responsibility of application layer to ensure security, correctness, reliability, proper delivery and many other such properties. The lower networking layers only reduce the cost of achieving them. They fundamentally do not guarrantee them.

This is the end-to-end argument. All the good properties of communication are basically implemented in the application layer(top layer) and the lower layers only help in reducing the cost.

--

--