I’ve been wrestling with the issue of “data storage that syncs and works offline” for a number of years now, and have gotten 80% of the way on a few different implementations, but always ultimately get stymied by the little things, like “how do I make it not break”.
And so I’ve once again started to create a data-storage & syncing solution, and as part of that I’ve been looking into the other options out there. I’m working on several posts evaluating various solutions, but I’ll start by outlining the criteria I’m using to evaluate them.
To start, by my definition of “local-first”, a data-storage solution must meet three requirements:
My app’s functionality must be network-optional; accessing & updating my data must work offline, on multiple devices (with syncing that ideally doesn’t require manual conflict resolution), for extended periods of time, and across browser / device restarts. (in short, we need client-side persistence & full data replication).
My data must be available. If I make a change (or create a document) on one device that’s connected to the internet, I should be able to go to another device and access that data, independent of whether the first device happens to be connected at the time. (we need server-side persistence & full data replication)
It’s fine with me if the “server-side persistence” is just “there’s a server in the cloud that is also a peer just like everyone else”.
There must be an open-source server implementation. If your fun startup goes out of business, I don’t want to be left high and dry.
When I went to look at the various solutions out there, it was surprising how few met even those simple criteria.
There are a number of closed-source solutions that claim to meet the first two criteria, but who knows when their “incredible journey” will end 🦄. And then there are several open-source solutions that fail to actually deliver on one or the other of my technical requirements. For example, WatermelonDB looks very promising, and claims to be able to sync, but doesn’t provide a server implementation. Swarm has some great demos, but the server is only available as a docker container, and I’ve been unable to track down any information (or source code) for it 🤔. Unfortunately, this field is strewn with “just enough to make HN excited” prototypes that don’t appear to have made it to “actually used in production”.
For the few projects that do make it past the baseline requirements, I’ve got a much longer list of criteria with which to evaluate them. They fall into the categories of Correctness, Cost, and Flexibility.
How are conflicts handled? Does it require the client (programmer) to write bespoke conflict resolution code?
How “bullet proof” is it? How easy is it to get it into a broken state? (e.g. where different clients continue to see inconsistent data dispite syncing)
Is there consistency verification built-in, to detect if you’ve somehow gotten out of sync?
How well does sync preserve intent? In what cases would a user’s work be “lost” unexpectedly, or a change that was made “later” get overridden by a change made before? Can a sync result in data that doesn’t make logical sense / is outside of the app’s logical schema?
How much data does the client need to store to fully replicate (full offline data access & editing)? Hopefully O(size of the data) and not O(size of the data + number of changes), with some reasonable constant factor. (this has huge impacts on the “initial app load” time)
How much data does the server need to store? If it needs to store a full change history, does it support periodic garbage collection / compaction?
What are the transfer costs of the sync protocol? E.g. are you sending the whole dataset with each change, or just deltas?
How’s the code quality, maintenance level, test coverage, etc.?
How does it react to schema changes? If you need to add an attribute to an object, can you?
Is the shape of data restricted to anything less than full JSON? e.g. are nested objects, and arrays supported?
Can it be integrated into an an existing (server-side or client-side) database (sqlite, postgres, etc.) or do you have to use their specific database?
Can it sync with Google Drive, Dropbox, etc. such that each user manages (and pays for) their own backend storage?
Does it require all data to live in memory, or can it work with mostly-persisted data? (such that large datasets are usable)
Does it support e2e encryption? (zero-knowledge server persistence)
Is multi-user collaboration possible?
Is collaborative text editing supported? (I’m fine paying more for it, in terms of server requirements, data overhead, etc.)
Does it have the concept of “undo” built-in? At what cost?
Does it support a fully p2p network setup (no central authority / server)?
How well does it handle offline behavior?
Does it correctly handle working on multiple tabs in the same browser session?
Does it bake in auth, or can you use an existing authentication setup?
remoteStorage.js - https://jaredforsyth.com/posts/local-first-database-remotestorage/
rxdb + pouchdb - https://jaredforsyth.com/posts/local-first-database-rxdb-pouchdb/
¸hypermerge + automerge - https://jaredforsyth.com/posts/local-first-database-hypermerge/
irmin - I can’t find any examples demonstrating how to have a web-client and a server-backend syncing together.
sharedb - doesn’t look like it does client-side persistence
swarm - can’t find source code or information on the server
yjs - tried following the tutorial, looks like various information has gotten out of date such that I was unable to get it running 😢. The CRDT impl looks really cool, but I couldn’t get server-side persistence working.
orbitdb - backed by IPFS, and it looks like “server-side persistence” is still an area of active research 😕
I’ll keep this post updated with the results as I evaluate various solutions. And if you know of a project I haven’t covered yet (or you can help me figure out one that I rejected), let me know on twitter 👋.