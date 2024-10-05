Cloud storage has become a bit of a nebulous term nowadays. You might initially think of cloud storage as Google Drive or OneDrive, but it's far more expansive. Most websites are probably served, at least in part, by cloud storage of some kind; whether on AWS' S3, running on a basic web server somewhere serving a static file, or hosted directly on a platform like Google Drive. All of these solutions have one thing in common though — they're not storing data 'in' the internet, they're storing it 'on' the internet — located at addresses of servers which themselves hold the information and, in theory, could disappear from the internet at any time. So, would it be possible to store data 'on' the internet, for free?

More data is stored in flight than you might think

You might be surprised to learn that this has been suggested and done before. Storing data in-fight isn't a new phenomenon. By storing data in flight, i.e. as it is being transmitted from one system to another, you're avoiding the systems needing to store all the data at once, reducing your storage costs in the process. This actually isn't an uncommon idea. Message queues like Apake Kafka or RabbitMQ, common technologies in software architecture, provide large, highly scalable systems that allow messages to be sent across distributed systems. Typically, while there is some element of persistent data storage, these systems are designed with real time processing in mind, and have minimal functionality for persistent storage.

High performance and real-time systems are similar also, with examples in the worlds of high-frequency trading, real-time data processing, and content delivery networks. Many of these systems trade an element of persistence for performance, but rely on not all (or any) of their data being saved to any form of persistent storage. By doing this, they can help avoid the bottlenecks that things like databases can encounter, battling race conditions, write locks, and struggling to maintain consistency while dealing with a physical bottleneck — i.e. a single file system.

How can we test this out?

There are a couple of ways to test this out, but arguably the coolest is to use ICMPs. ICMP (or Internet Control Message Protocol, but better known as ping) is a protocol designed to determine whether connectivity between two hosts can be established, and is mainly used for diagnosing network issues. A 'ping' is a small message sent to another host to confirm its availability, with the second host sending a small response.

This is where the magic begins — ICMP supports a data payload, which is returned to the original host in the ICMP response. This response takes a period of time — hence the colloquial use of 'ping' for latency, especially among gamers. During this time, your data is not stored on your file system directly, but 'in the internet' as it travels from one host to another, and then back again. If you do this with many hosts, and many, many ping packets, it's possible to generate an amount of extremely inconsistent data storage on the internet.

What's going on under the hood here is actually an ICMP Echo Request (a form of request in the ICMP protocol) and an ICMP Echo Reply. This request is essentially one device asking another "Are you there?" and the other device responding. An ICMP Echo Reply follows in response, and includes the original data payload of the original request.

When we say 'on the internet', what we really mean is in the buffers of various network devices across the internet. This 'buffer bloat' is a cost for providers, ISPs and servers, so there's an antisocial aspect to using this idea on any devices you don't control.

Your data might not come back in one piece

There are obviously some problems with this. ICMP relies on UDP (User Datagram Protocol), which means that you have no guarantees over data integrity. A single ICMP packet is also only capable of storing ~1500 bytes in its data payload, which means that to generate a Gigabyte of cloud storage you'd need to maintain ~715,000 open connections at any one time. This is close to impossible — your PC, router, and switching hardware will choke out way before you manage that many open connections, and that's assuming you've both avoided being throttled by your ISP and found enough unique servers responding to pings (probably the easy part).

As ICMP relies on UDP, there's no guarantee that you'll see that data again, for example, if you choke your network or are throttled, leaving it lost to the ether. Hence, you'll quickly find that any implementation of this kind of system will shred your data, and fast.

High ping times will lend a hand

There is one weird quirk with this system of storing data however. The amount of data you can store in the 'cloud' at any given time is largely dependent on a few things — firstly, how many open connections you can maintain and how long it takes for each connection to come back. We can understand that there must be some ratio of data currently on your PC, having been received and awaiting transmission, compared to what's currently in transit. By minimizing the amount of data on the PC at any one time, either by handling retransmissions faster, or by increasing the time that data spends in the air, we increase the ratio of effective 'in-flight' storage, and help to reduce the burden on our original machine.

Versus a shorter time in flight:

In this way, the best set of servers to send your ping requests with the longest possible response times would be ideal, as it'll allow you to store more data in flight and reduce the load on your network.

Has anyone actually implemented this?

Source: Github

This idea has been around for a while, and there is a working implementation. PingFS is a Linux-based implementation for a complete file system relying on this idea. Getting it running and building it proved difficult, however, so I wasn't able to test it out fully. It is fairly trivial to validate the concept in Python though, and I've written up an example that's a bit easier to play with that's available on GitHub. This demonstrates the ability of ICMP Echo and Reply packets to store data, with the script reproducing the original string from the received reply. Doing this on scale with a file system is a vastly more complex endeavor though, and while fun to play with, this example basically demonstrates what we already expect to happen within the ICMP standard.

One thing is for sure, expect to lose a lot of your data doing this. Even making a basic attempt to ramp up the above script to manage chunks of data from a large data source brought my laptop and LAN quickly to its knees.

A cool, though highly impractical concept

While there are systems all around us that effectively store data in transmission, or at least in a transient format while it is being transmitted, this is a pretty terrible way to store your data. Not only is it heavily antisocial on the wider internet, offloading your data storage needs into the buffers and memory of systems run by other users and your ISP, but it is incredibly performance-heavy, requiring a massive amount of switching and packet processing per bit of data (not to mention choking your laptop).

It is, however, a cool concept for offloading storage from your machine, and a good way to understand a little more about how the internet works. Consider briefly the volume of data 'in-flight' at any one moment. I couldn't think of a good way to ballpark it, but I'd suspect it's a very large number indeed.