This solution instead seems to rely on using 1MB blocks and store those directly as objects, alleviating the intermediate caching and indirection layer. Larger number of objects but less local overhead.
DelphiX's rationale for 16 kB blocks was that their primary use-case was PostgreSQL database storage. I presume this is geared for other workloads.
And, importantly since we're on HN, DelphiX's user-space service was written in Rust as I recall it, this uses Go.
Why would I use zfs for this? Isn't the power of zfs that it's a filesystem with checksum and stuff like encryption?
Why would I use it for s3?
zfs-share already implements SMB and NFS.
Not sure what is the use case out of my ignorance, but I guess one can use it to `zfs send` backups to s3 in a very neat manner.
Saves me the step of creating an instance with EBS volumes and snapshotting those to S3 or whatever
haven't done the math at all on whether that's cost effective, but that's the usecase that comes to mind immediately
You have it the wrong way around. Here, ZFS uses many small S3 objects as the storage substrate, rather than physical disks. The value proposition is that this should be definitely cheaper and perhaps more durable than EBS.
See s3backer, a FUSE implementation of similar: https://github.com/archiecobbs/s3backer
See prior in kernel ZFS work by Delphix which AFAIK was closed by Delphix management: https://www.youtube.com/watch?v=opW9KhjOQ3Q
BTW this appears to be closed too!
EBS limitations:
- Per-instance throughput caps
- Pay for full provisioned capacity whether filled or not
S3:
- Pay only for what you store
- No per-instance bandwidth limits as long as you have network optimized instanceEdit: Oops, "zpool create global-pool mirror /dev/nbd0 /dev/nbd1" is a better example for that. If it's not that, I'm not sure what that first example is doing.
I can see it on not real S3 though.
I expect this becomes most interesting with l2arc and cache (zil) devices to hold the working set and hide write latency. Maybe would require tuning or changes to allow 1m writes to use the cache device.
LSMs are “for small burst workloads kept in memory”? That’s just incorrect. “Once compaction hits all bets are off” suggests a misunderstanding of what compaction is for.
“Playing with fire,” “not sure about crash consistency,” “all bets are off”
Based on what exactly? ZeroFS has well defined durability semantics, guarantees which are much stronger than local block devices. If there’s a specific correctness issue, name it.
“ZFS special vdev + ZIL is much safer”
Safer how?
The secret is that ZFS actually implements an object storage layer on top of block devices and only then implements ZVOL and ZPL (ZFS POSIX filesystem) on top of that.
A "zfs send" is essentially a serialized stream of objects sorted by dependency (objects later in stream will refer to objects earlier in stream, but not the other way around).
Not only backup but also DR site recovery.
The workflow:
1. Server A (production): zpool on local NVMe/SSD/HD
2. Server B (same data center): another zpool backed by objbacker.io → remote object storage (Wasabi, S3, GCS)
3. zfs send from A to B - data lands in object storage
Key advantage: no continuously running cloud VM. You're just paying for object storage (cheap) not compute (expensive). Server B is in your own data center - it can be a VM too.
For DR, when you need the data in cloud: - Spin up a MayaNAS VM only when needed
- Import the objbacker-backed pool - data is already there
- Use it, then shut down the VM zfs send -R localpool@[snapshot] | zfs recv -F objbackerpool
Is there a particular reason why you'd want the objbacker pool to be a separate server?Main issue with opening it further is lack of DMU-level userland API, especially given how syscall heavy it could get (and iouring might be locked out due to politics)
In theory it should be a pretty good match considering internally ZFS is an object store.