SQLite Is a Library of Congress Recommended Storage Format

643 points by whatisabcdefgh 2 days ago | 191 comments

alexpotato 2 days ago |
I have always loved SQLite.
I have also heard that some firms ban its use.
Why?
Because it makes it SO easy to set up a database for your app that you end up with a super critical component of your application that looks exactly like a file. A file that can have any extension. And that file can be copied around to other servers. Even if there is PII in that file. Multiply this times the number of applications in your firm and you can see how this could get a little nuts.
DevOps and DBA teams would prefer that the database be a big, heavy iron thing that is very obviously a database server. And when you connect to it, that's also very obvious etc etc.
I still love SQLite though.
ai_slop_hater 2 days ago |
That's so dumb
Fwirt 2 days ago |
The question is, do the same firms ban Excel? Excel spreadsheets often end up as shadow databases in unlikely places.
hermitShell 2 days ago |
The sane thing would be to ban Excel and promote SQLite. Excel is often used for tabulated text (issue tracking) not calculations. Perfect use case for a relational db
frollogaston 2 days ago |
Excel is made for calculations. But if you make it hard to make a DB, people will abuse Excel as a DB.
TJSomething 2 days ago |
I mean, it might have been at first, but Microsoft figured out that the majority of users for lists without formulas in 1993 and they've strategized around that. IMHO, the biggest concession to this was when they added Power Query to core Excel in 2016.
rswail a day ago |
Excel has sheets for tables, columns and rows, primary keys (UNIQUE), foreign key references etc if you squint.
It doesn't require you use all of that properly, but it's there.
0123456789ABCDE a day ago |
and excel has gui for forms
rantingdemon a day ago |
Only where VBA is available. Not available for MacOs versions if I'm correct?
nereye a day ago |
VBA is just there for backward compatibility.
The modern alternative is to use JavaScript/TypeScript, which makes such solutions cross platform (including MacOs, web etc.):
https://learn.microsoft.com/en-us/office/dev/add-ins/overvie...
harvie a day ago |
or reimplement excel with sqlite as a backend :-D
BTW sqlite can run SQL queries on CSV files with relatively simple one-liner command...
euroderf a day ago |
Well heck can't someone make an SQLite extension that is basically just a simplified Excel ?
Spooky23 2 days ago |
They generally cannot. But they do banish Access.
pasc1878 a day ago |
Now that is different.
Access gets used for a shared DB and that is quite easy to corrupt. It is much more cost effective to have that in a proper central database (I supse SQLLite is better here as well)
cwillu a day ago |
Excel is also a shared DB: it has supported multiple concurrent users accessing and modifying the same spreadsheet for decades.
DeathArrow 2 days ago |
Do companies ban text files? Text files are used to store data.
altmanaltman a day ago |
Do companies ban brains? Brains are used to store data.
shermantanktop a day ago |
Brains do not exhibit ACID properties…
yard2010 a day ago |
Do companies ban data centers? It's crazy to send PII to other computers on the line.
perching_aix a day ago |
That's why you store them on unsaved tabs instead.
silon42 2 days ago |
IMO, almost any Excel more than a month old should become readonly.
irishcoffee a day ago |
You should consider knock-on effects of this brilliant idea. Now there would be copies of spreadsheets younger than a month that get replicated 47 billion times, exponentially compounding the problem you're trying to solve.
This sounds like how we pass so many stupid laws. Nobody thinks about 2nd order effects.
perching_aix a day ago |
So you're saying they should further auto-delete after two or three months?
irishcoffee a day ago |
3rd order effect, people copy and paste the old sheet into a new sheet, now we have worse exponential. You’re not very good at this huh.
perching_aix a day ago |
Which is very annoying and people will complain. People complaining can be then directed towards a better solution. As a bonus, mistakes will also rise, leading to further complaints, especially ones that reach higher. All this making the dogshit practice, and the idiots committing them, infinitely more visible and thus fixable.
The sheer volume of data that needs tending to may even grind certain departments to a halt! What a great opportunity! It'd appear I'm positively stellar at this!
irishcoffee a day ago |
Sorry for the snark, that was shitty of me.
perching_aix a day ago |
No worries, was a bit of a gamble of a joke from me (sarcasm frequently doesn't translate in text, or can be inopportune), so I tried taking it accordingly.
For clarity, while I did have some rather perverse fun toying with the idea, I do not actually think it should be implemented, or at least certainly not in one fell swoop and as-is. Mostly for the aforelisted reasons. This is what I actually intended to convey.
Though for better or for worse, that doesn't mean I think the notion is completely meritless either, so I might still be deserving of at least some of your snark. But a lot of it was in jest from my side indeed.
This whole keeping an inventory, disabling items, and later taking them out completely is a chore I already do in other contexts. While it does work, it is anxiety inducing, doesn't really scale well, and it's quite miserable to go through with. It's the cost of keeping things organized as far as I'm concerned, with no real way around it in the general case. The best one can do is try and mitigate it, by monitoring for patterns and building out systems, to automate and streamline the tedium away.
I do sincerely not know of other ways to keep things in check though, in lieu of which you do get the makeshift shadow ops with all of its pitfalls. It's kind of just life.
silon42 12 hours ago |
Doesn't this happen anyway with "final2.really.final" ?
croon a day ago |
This might catch flak, but generalizing I would assume that the people banning things are the same people who would use excel for something where a database would be better, and if so, that is the reason Excel isn't banned on the same conditionals that would get sqlite banned.
mr_toad a day ago |
I’ve worked at some organisations that have strict rules (not always strictly followed) about what can go in Excel spreadsheets, and where they have to be stored. The C drive is verboten. Some also have standards about classification and labelling of PII and sensitive data.
haspok a day ago |
You can enforce classification and privacy labels (or something similar) in Excel and other document files, at least in a closed corporate environment. Azure also supports this. Also, everyone has Office installed (in a corporate environment), anyone can open and work with an Excel file.
jnwatson a day ago |
I don't have Office installed, nor do a significant majority of my peers. Given that sqlite is installed by default on Macs, a sqlite file is far more portable than an Excel file.
forinti a day ago |
Don't get me started on Access...
mcdonje a day ago |
Man, Access could've been so good if they just made an app around SQLite. Or since it's Microsoft and they need to do everything their own way, it would've been so good if they made a flat file DB à la SQLite, but with T-SQL (or a subset thereof) instead of JET-SQL.
Increase interoperability. Funnel data people from Excel into real DB technologies.
And if they did more to blur the lines between spreadsheets and databases, and make it seamless to work out of both Excel and Access, add more spreadsheet features to the data views, etc.
mcdonje a day ago |
PII sniffers are pretty good at dealing with excel files. Excel is seen more as an analyst tool than a dev tool. Any place that bans Excel needs to either let analysts use some other turing complete data tools, like python or R or something, or they'll have trouble attracting analyst talent. They'll have devs and data entry users and that's it.
The only way that works is if the dev team is large enough to be responsive to business needs, which almost never happens because devs are expensive. The juniors who are tweaking business logic every day are functionally doing a role analysts can do if you just give them a sane API and data tools.
slopinthebag 2 days ago |
> DevOps and DBA teams
Ah so two teams nobody should listen to.
frollogaston 2 days ago |
At least would take it with a grain of salt when the DBA wants you to depend more on the DBA.
slopinthebag 2 days ago |
Same with devops tbh.
"Hey everyone, we need to chose the option that involves us the most and provides us the most job security"
mschuster91 a day ago |
Well... eventually the company learns the lesson the hard way, either because a site goes down or gets 0wned. Then everyone will cry about "how this could happen", and the ops people will tell you in response "we warned you that this would happen, here's the receipts, now GTFO".
shermantanktop a day ago |
Preparing to say “I told you so” is a fairly obvious incentive for someone to act like Chicken Little. And of course sometimes they are right, but not always.
Lots of great people in both devops and security. But when teams position themselves as the conscience of the org and the gatekeepers of production, the defensive victim mentality can get pretty strong.
tehlike 2 days ago |
There are interesting uses for sqlite, like this one: https://sqlite.org/sqlar.html
gandutraveler a day ago |
DevOPs and DBAs must hate RAM and caches. We
Scribbd a day ago |
I recently watched a YT video about this subject: https://www.youtube.com/watch?v=lSVgeMoXJTs
In summary, companies use the bus-metric to see how viable a project is. Bus, as in, how many people can be hit by a bus before there is no one left to maintain the project.
Despite its ubiquity, SQLite is maintained by only 3 people. That bus-metric for SQLite is 3, which is way too low for some companies.
Give the link a watch; it was really interesting.
BenjiWiebe a day ago |
At least with SQLite, it is really stable so if development did cease, you'd probably be fine indefinitely.
redeeman 7 hours ago |
and anyone that considers this to be the case for sqlite, should probably have their reasoning skills examined.
if the unfortunate bus incident happens to sqlite developers, there is exactly ZERO chance that it will not be very well maintained on the count of all the users, many of whom already have support contracts going for decades, and which would require the same level of support they have already enjoyed.
giancarlostoro a day ago |
This is why I put configs like that into AppData or dotfile directories, or the equivalent for MacOS (I forget which one it is inside of the ~/Library directory).
y-curious a day ago |
Required reading for “anything can become a mission critical database” conversations:
https://www.reddit.com/r/sysadmin/comments/eaphr8/a_dropbox_...
duped a day ago |
> a file that can have any extension
So read the magic number, you shouldn't trust file extensions anyway
> that file can be copied around to other servers
So can spreadsheets
I'm not discounting that having centralized data access is desirable but it doesn't sound like that particular reasoning is well thought out
WorldMaker a day ago |
This "shadow IT DBA" issue has always been a classic problem with Access databases, too.
TheRealPomax a day ago |
Some firms don't understand how to do data management, and if we draw the venn diagram of those and the ones that ban sqlite, it'd be pretty close to a circle.
Yes, databases could have any extension. No sane dev team would accept code that doesn't use an object extension for a sqlite database.
Yes, databases can contain PII but no sane product manager would go "yes, that's a good use of sqlite".
Yes, you can trivially copy database files, but no sane product needs to in the same way that no sane product should require folks to just clone the db just to do some work.
Pretty much every reason a company has for banning sqlite is a red flag for working there.
darthwalsh 16 hours ago |
A production app with customer data needs a data backup/restore strategy. I'm guessing a random app server writing to a local sqlite file isn't doing that either.
moralestapia 4 hours ago |
>DevOps and DBA teams would prefer that the database be a big, heavy iron thing that is very obviously a database server. And when you connect to it, that's also very obvious etc etc.
If I was their CTO and they told me this, and it is not a joke, I'd fire them on the spot.
srcreigh 2 days ago |
2026 recommended storage formats: https://www.loc.gov/preservation/resources/rfs/data.html
nashashmi a day ago |
Taking a minute to appreciate the level of long term thinking required for storing data, to plan for 300-500 years into the future, to be able to withstand all kinds of innovations, and survive basic obsolescence.
What is the longest surviving paper medium?
StilesCrisis a day ago |
Dead sea scrolls come to mind, for some values of "survive." And the Book of Kells is in good shape.
IshKebab a day ago |
Seems like they're pretty lax about their recommendations tbh. XLS is "preferred".
akihitot 2 days ago |
For public-sector data preservation, it may be one of the best options.
The specification is publicly available
- It is widely adopted - It is likely to remain readable in the future - It has little dependency on specific operating systems or services - It carries low patent risk
From the perspective of long-term continuity, avoiding dependence on any particular company or service is extremely important.
Spooky23 2 days ago |
Archivists also love formats close to native. SQLite lets the relational relationships be present in a way that csv cannot.
akihitot 2 days ago |
That's certainly true. The ability to define table relationships is a major difference from CSV.
b40d-48b2-979e a day ago |
Foreign keys are not enforced unless you enable it but only for that connection.
ray_v 2 days ago |
It's so funny, because I was JUST telling a colleague of mine - another librarian - this exact fact about sqlite!
rmunn 2 days ago |
> As of this writing (2018-05-29) ...
So this news is nearly <del>six</del> EIGHT years old. But I didn't happen to know about it until now, so that's not a complaint at all; rather, this is a thank-you for posting it.
(Thanks for the correction. Brief brain malfunction in the math department there).
tehlike 2 days ago |
Sir, it's 2026. It's 8 years old.
rmunn 2 days ago |
Corrected; thanks.
harrouet a day ago |
Not if the GP was written 2 years ago :)
frollogaston 2 days ago |
Was going to say, was having deja vu reading this
mcfedr a day ago |
makes sense really, nothing this sensible is gonna happen under the current US administration
tombert 2 days ago |
On a recent project I have needed to use exFAT. exFAT is terrible for a number of reasons, but in my case the thing I had to deal with was the lack of journaling, which had the possibility to corrupt files if there were a power interruption or something.
I initially was writing a series of files and doing some quasi-append-only things with new files and compacting the old one to sort of reinvent journaling. What I did more or less worked but it was very ad hoc and bad and was probably hiding a lot of bugs I would eventually have to fix later.
And then I remembered SQLite. I realized that ACID was probably safe enough for my needs, and then all the hard parts I was reinventing were probably faster and less likely to break if I used something thoroughly audited and tested, so I reworked everything I was doing to SQLite and it worked fine.
I wish exFAT would die in a fire and a journaling filesystem would replace it as the "one filesystem you can use everywhere", but until it does I'm grateful SQLite exists.
mmooss 2 days ago |
> I wish exFAT would die in a fire and a journaling filesystem would replace it as the "one filesystem you can use everywhere"
Where exactly is everywhere? Win32? All of Linux? BSDs? MacOS? IOS? ...
ghrl a day ago |
Something MacOS and Windows support natively would be a good start, it could grow from there.
Ringz a day ago |
Looking at *all* my external drives now... that would be great.
tombert a day ago |
Everywhere exFAT is supported now. Windows, Mac, Linux, FreeBSD would be fine.
pbhjpbhj a day ago |
Presumably Microsoft fear making it easy to swap OSes and access the same data.
"I can use Linux because if I get stuck I can just switch to Windows and still access my data" is a comfort that probably keeps people from even trying Linux (or other OSes)?
Why else would MS not support BTRFS/ZFS/Ext or whatever?
{I'm not saying that I think this works.}
iknowstuff a day ago |
> Why else would MS not support BTRFS/ZFS/Ext or whatever?
You seriously can’t think of another reason? File systems are complex. Maintenance is a huge burden. Getting them wrong is a liability. Reason enough to only support the bare minimum. And then, 99% of their users don’t care about any of those. NTFS is good enough
StilesCrisis a day ago |
NTFS is dog slow. Unfortunately it's nowhere near good enough.
tombert a day ago |
In my mind, in the year 2026, I don't really see the point in using a non-CoW filesystem; it would be nice if the Windows System Restore tool actually worked, and that could be achieved much simpler if there were filesystem-level snapshots.
garaetjjte 5 hours ago |
I don't think this is necessarily fault of NTFS, but Windows filesystem stack.
jodrellblank a day ago |
Have you seen Linus Torvalds' comments on ZFS from 2020?
https://www.realworldtech.com/forum/?threadid=189711&curpost...
".. there is no way I can merge any of the ZFS efforts until I get an official letter from Oracle that is signed by their main legal counsel or preferably by Larry Ellison himself .. Don't use ZFS. It's that simple. It was always more of a buzzword than anything else, I feel, and the licensing issues just make it a non-starter for me. .. The benchmarks I've seen do not make ZFS look all that great. And as far as I can tell, it has no real maintenance behind it either any more, so from a long-term stability standpoint, why would you ever want to use it in the first place?"
BTRFS: RedHat has removed all support for BTRFS and deprecated it: https://access.redhat.com/solutions/197643
BTRFS, "Linux's perpetually half-finished filesystem" by ArsTechnica: https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu... with many problems still unaddressed in 2021 dating back to 2009.
tombert a day ago |
I definitely understand why he doesn't want to merge it in and risk potential litigation from Oracle, but I think he's kind of wrong about the rest of what he says.
I don't know what people on Solaris use, but I'm pretty sure everyone in the Linux and BSD community is running OpenZFS, which does get frequent updates and has been pretty stable as a kernel module for quite awhile. My main server in my house is running a RAID-Z2 on Linux and has been for more than six years, and I haven't really had any issues. I run scrubs regularly and things seem to work just fine.
I do wish that Oracle would give written permission to let Linux include it into the kernel, since I think it would make it easier to run ZFS on root (which I don't bother with, I just use btrfs on root and that's fine for single-drive systems, like a laptop).
noirscape a day ago |
Everywhere in the sense of "I have a USB stick/SD card, what do I format it to so that every major device I'm using can read it".
In practice, every OS has its preferred system and the rest has varying levels of "I guess this works", with FAT32 and exFAT being the only real cross-platform options.
To wit:
* NTFS is only really properly and fully supported on Windows. Apple mounts it read-only. Linux can certainly mount NTFS and do some basic reads and writes. Unfortunately for whatever reason, the Linux fsck tools for NTFS are absolutely terrible, poorly designed and generally can't fix even the most basic of issues. At the same time, mount refuses to work with a partially corrupted filesystem, so if you're dealing with dirty unmounts (where the worst case usually is some unclosed file handle rather than data loss, but this also happens if you try to mount a suspended Windows parititon, which isn't uncommon since Windows hibernates by default and calls it fast boot), that's a boot to Windows just to fix it.
* Apple filesystems basically only work on apple devices. It's technically possible to mount them on Linux, but you end up digging into the guts of a bunch of stuff that Apple usually just masks for you.
* ext4 is only properly read/write under Linux and requires external drivers under Windows (which may not work properly either, as corruption issues are common).
FAT32 is reliable in that any OS can fsck/chkdsk it and properly mount it without needing special drivers, but is hindered by ancient filesize limitations. exFAT, at least for most cases, is the only filesystem you can plug into most devices and expect more or less the same capabilities as FAT32 (read/write support, can fix filesystem corruption.)
Out of the os specific ones, NTFS seems like it has the most potential to be the one filesystem that works everywhere; it's modern, works good-ish on most devices, it's just that the fsck/chkdsk tooling is awful outside of Windows.
topham 2 days ago |
The problem with it is you didn't solve your biggest actual problem, you just haven't had a problem bite you in the ass yet so you think your problem is solved.
tombert a day ago |
I am not sure the problem is actually fully solvable. I think SQLite helps at least a little.
IshKebab a day ago |
It's totally solvable and SQLite solves it (or claims to anyway). The real question is if it works. To test this sort of thing properly you really need what is now called DST and I'm not sure SQLite does that. It is pretty well tested though so they've probably done at least some testing of it.
tombert a day ago |
I guess some context; I'm not 100% sure it's solvable for the actual domain I'm working on, which is Micro SD cards; they have a tendency to lie about write success.
I think that is at too low of a level for me to realistically solve it, but with SQLite it will at least do what little I can; the fact that it's been around for twenty years with extremely thorough testing and frequent updates means that it's more likely to be correct than some ad-hoc thing I come up with. I think I'm pretty clever sometimes and I could probably get something *as good as SQLite if I really wanted to, but I don't think I'd surpass it and at that point why not just use SQLite?
IshKebab a day ago |
> they have a tendency to lie about write success
As long as they lie in order, or alternatively you have a way of verifying the write (e.g. by reading it back) then you should be able to make it work fairly easily.
If they just completely lie - the data is just cached but never actually written - then you're screwed. There's obviously no way to make a persistent storage device out of something that doesn't persist your data.
tombert a day ago |
In my experience it's the latter as far as I can tell. It has actually written like 99.99% of the time, but about 1/10000 writes it actually isn't writing.
exFAT has the lovely feature of potentially not only corrupting the file, but also corrupting the metadata for the surrounding system as well. It's terrible.
faangguyindia 2 days ago |
I went from thinking “SQLite is a toy product, not reliable for real data" to "lets use SQLite for almost everything"
SQLite is very good if you can fit into the single writer, multiple readers pattern; you'll never lose data if you use the correct settings, which takes a minute of Google search to figure out.
Today, most of my apps are simply go binary + SQLite + systemd service file.
I've yet to lose data. Performance is great and plenty for most apps
michaelchisari a day ago |
The single writer is less of an issue in practice than it's made out to be. Modern nvme drives are incredible and it's trivial to get 5k writes per second in an optimized WAL setup. Way more than most apps could ever dream.
And even then, I've used a batch writer pattern to get 180k writes per second on a commodity vps.
Ringz a day ago |
I usually try to explain it like this: “Single writer” is rarely a real problem, because a writer is not slow. It writes exclusively, but very quickly.
"Batch writer pattern" is a good idea to get rid of expensive commits.
0123456789ABCDE a day ago |
all* of that + sharding -> https://sqlite.org/lang_attach.html
ex: main.db + fts.db. reading and writing to main.db is always available; updating the fts index can be done without blocking the main database — it only needs to read, the reads can be chunked, and delayed. fts.db keeps the index + a cursor table — an id or last change ts
could also use a shard to handle tables for metrics, or simply move old data out of main.db
* some examples:
conn = sqlite3.connect("data.db") conn.execute("PRAGMA journal_mode=WAL") # concurrent reads (see above) conn.execute("PRAGMA synchronous=NORMAL") # fsync at checkpoint, not every commit conn.execute("PRAGMA cache_size=-62500") # ~61 MB page cache (negative = KB) conn.execute("PRAGMA temp_store=MEMORY") # temp tables and indexes in RAM conn.execute("PRAGMA busy_timeout=5000") # wait 5s on lock instead of failing
edit: orms will obliterate your performance — use raw queries instead. just make sure to run static analysis on your code base to catch sqli bugs.
my replies are being ratelimited, so let me add this
the heavy duty server other databases have is doing that load bearing work that folks tend to complain about sqlite can't do
the real dmbs's are doing mostly the same work that sqlite does, you just don't have to think about it once they're set up. behind that chunky server process the database is still dealing with writing your data to a filesystem, handling transaction locks, etc.
by default sqlite gives you a stable database file, that when you see the transaction complete, it means the changes have been committed to storage, and cannot be lost if the machine were to crash exactly after that.
you can decide to wave some, or all of those guaranties in exchange for performance, and this doesn't even have to be an all or nothing situation.
hparadiz a day ago |
Oh fun something I have some metrics on. I just made this benchmark for every php orm a few weeks ago for fun.
https://the-php-bench.technex.us/
There's a huge performance difference between memory and file storage within sqlite itself. Not even getting into tuning specifics.
ashellunts a day ago |
Do you use multiple backend nodes? If yes, how do you access sqlite files from different nodes?
faangguyindia a day ago |
I use it for apps which don't need multiple backend nodes.
When i actually have something that requires multi nodes, i just use postgres (with replica) or mongo (with replica).
But it's for those apps which are in autoscaler.
For bulk data refresh I use build artifact and hotreload memort mapped files, by checking a manifest on object storage then only getting update if newer.
I've used this pattern everywhere and never really needed anything more, occasionally i might use redis if something required shared state across multiple nodes and fast.
xenadu02 16 hours ago |
You can always route writes to a writer node with streaming WAL replication to all the reader nodes. Works for some workloads and systems, not for others.
For that matter if you write your system with the correct abstraction you can switch to Postgres _later_ if it becomes necessary. For every system that really did need to scale 10,000 are pointlessly overbuilt - worrying about scale when it just didn't matter.
beoberha a day ago |
For me, the concern about SQLite has never been if the database engine itself is “reliable for real data”, but that storing data on a single node is not “reliable for real data”. Performance aside, what you are positing is no different than dumping everything to a text file on disk. What happens if that VM dies?
aperrien a day ago |
If the file is that important, it shouldn't be stored in the VM, but on some sort of more robust storage system.
ohnei a day ago |
The standard for pretty much any multiuser app of a reasonable size is a quorum of SQL or noSQL DBs preferably as a single source of truth for all retainable state. Personally, I think foundationDB is the closest to an attempt to make the minimal viable base layer that I've encountered. But C/C++ based and then owned by apple make it not suitable for the role.
hellcow a day ago |
It's trivial to set up WAL-based streaming backups. Same thing as you'd have with Postgres on GCP. Restore from your latest backup.
ramblurr a day ago |
Postgres, MySQL, all of them, they write to files (binary, not text) on your disk. What happens if your postgres VM dies?
(Hint: whatever your answer is it'll apply to SQLite too)
afshinmeh 2 days ago |
I love SQLite and thanks for sharing it but there should be a "(2018)" at the end in the title:
> As of this writing (2018-05-29) the only other recommended storage formats for datasets are XML, JSON, and CSV.
maxloh 2 days ago |
FYI, they added a lot more formats to the list after that.
Preferred 1. Platform-independent, character-based formats are preferred over native or binary formats as long as data is complete, and retains full detail and precision. Preferred formats include well-developed, widely adopted, de facto marketplace standards, e.g. a. Formats using well known schemas with public validation tool available b. Line-oriented, e.g. TSV, CSV, fixed-width c. Platform-independent open formats, e.g. .db, .db3, .sqlite, .sqlite3 2. Any proprietary format that is a de facto standard for a profession or supported by multiple tools (e.g. Excel .xls or .xlsx, Shapefile) 3. Character Encoding, in descending order of preference: a. UTF-8, UTF-16 (with BOM), b. US-ASCII or ISO 8859-1 c. Other named encoding --- Acceptable For data (in order of preference): 1. Non-proprietary, publicly documented formats endorsed as standards by a professional community or government agency, e.g. CDF, HDF 2. Text-based data formats with available schema For aggregation or transfer: 1. ZIP, RAR, tar, 7z with no encryption, password or other protection mechanisms.
https://www.loc.gov/preservation/resources/rfs/data.html
xxs a day ago |
.7z being there just discredits the entire process. The underlying compression algorithm is a free-hand one and can be anything[0], or contain bugs and exploits[1]. Personally I use only zstd with .7z which is 'non-standard' by the official (Russian) release.
[0]: https://7-zip.org/7z.html
[1]: CVE-2025-0411
tnelsond4 a day ago |
I love using zstd, it's so fast to decompress. I especially like that the JavaScript decoder is 8kb and still really fast. Though the 25kb wasm decoders are about twice as fast.
What are the advantages or reasons to use zstd in a 7z container versus just .zst?
xxs a day ago |
I love zstd as the next guy and I do use zstd solo for the most part. I had a talk on it few years back too (incl. using the lib directly from Java, massively decreasing log storage, and so on).
Why use it w/ 7-zip though. 7-zip archives multiple files/directories and supports encryption. It has the UI too.. On Windows there is NanaZip that's available in the microsoft store which has been graced by corporate for user-install (unlike zstd that effectively needs WSL), and most folks won't be able to use the command line tool.
Of course using tar with zstd is always an option if you are on linux.
tnelsond4 a day ago |
I'm always inspired by SQLite. Overall I like it, but if you're not doing writes it's really overkill.
So I made a format that will never surpass SQLite, except that it's extremely lighter and faster and works on zstd compressed files. It has really small indexes and can contain binaries or text just like SQLite.
The wasm part that decompresses and reads and searches the databases is only 38kb (uncompressed (maybe 16kb gzipped)). Compare that to SQLite's 1.2mb of wasm and glue code it's 3% the size but searching and loading is much faster. My program isn't really column based and isn't suitable for managing spreadsheets, but it's great for dictionaries and file archives of images and audio.
I ported the jbig2 decoder as a 17kb wasm module, so I can load monochrome scans that are 8kb per page and still legible.
https://github.com/tnelsond/peakslab
SQLite is very well engineered, PeakSlab is very simple.
zoky a day ago |
something something XKCD competing standards something something
tnelsond4 a day ago |
Believe me, I tried sticking to SQLite or aard2 or stardict, they just were fundamentally inadequate with no good pwa cross platform tooling.
bbkane a day ago |
Does this remain true now that SQLite has a WASM build?
tnelsond4 a day ago |
Yes, because originally when I started PeakSlab it used the SQLite wasm build.
lpln3452 a day ago |
Creating something new for a different use case isn't pointless. It's like comparing inline skates to ice skates.
keybored a day ago |
Doesn’t even apply unless someone says that (1) there are too many “standards”, and (2) so we are making this standard (neither apply here). Someone made something.
We should really consider eventually retiring memes because they just end up as thought-terminating cliches.
This is of course referring to xkcd #927. How do I know that?
giza182 a day ago |
Perhaps a dumb question, but how do you get data into it if you’re not doing writes
tnelsond4 a day ago |
Generate it one time from a source tsv file or folder of media.
andrelaszlo a day ago |
I think it's just immutable once you've generated it. No need to update indexes or check consistency on writes, no need for transactions, etc.
pfortuny a day ago |
Think historical records of, say, share values for past years. You might have a single db for 1900-2000, for instance. Things like that.
Not everything needs to be real-time updated.
shermantanktop a day ago |
It’s an RODB. Ship the preindexed data blob.
electroly a day ago |
I have a system that builds SQLite databases and uploads them to S3. Once they're in S3, they are never changed. The program that builds the databases only does writes, and the program that queries the databases only does reads. It uses a VFS to query the database in-place with HTTP range requests.
This is indeed not an optimal setup. A more careful design from first principles would not require seeking around the file as much as SQLite does, we'd do a better job on reading exactly the correct range of bytes for a given query since we know ahead of time what the access patterns are, and we could do reads in parallel. With SQLite we have to be very careful about the schema design to ensure it won't have to seek too many times to answer a query. But SQLite was expedient, and I'm confident I'll always be able to read the files. That's less certain for a custom file format.
giovannibonetti a day ago |
If it's going to be read-only, why not make it a Parquet file instead? It should result in a smaller file size due to columnar compression.
DuckDB has built-in capability to read Parquet files with HTTP range requests.
electroly a day ago |
For this use case we need the ability to do an indexed query and extract a small number of rows from a large database. It's a traditional row-oriented database workload. I'm sure other solutions would also work, but SQLite's design melds well with the data. The migration from partitioned SQL Server tables to a collection of SQLite database files was straightforward.
1718627440 4 hours ago |
If it's that small, why not compile it directly into the application? The compiler might even be able to optimize a good chunk of data away, or optimize the code based on the data.
electroly 4 hours ago |
Nothing small about it. This is terabytes of data and new files are added daily with new data. That's why we had to move it into S3. It was costing too much to store in SQL Server.
I only reach for SQLite when it's big. Otherwise, I'd prefer to just use the existing SQL Server infrastructure, or .csv.gz files in S3. Internally, I use the term "bulk data" for data sets that are stored in S3 because they were too big for SQL Server.
pjc50 a day ago |
I think actually this competes with the old BerkeleyDB: https://en.wikipedia.org/wiki/Berkeley_DB - which I now see is no longer BSD-licensed, and in any case has been rendered almost extinct by SQLite. It was used for basic on-disk key-value store work.
tnelsond4 a day ago |
Even BerkeleyDB tries to be mutable. What I'm doing doesn't need the mutability so it's much more similar to dictionary formats (though probably simpler) than it is to a database. Though a lot of people do use full databases for immutable dictionary key-value stuff. I just couldn't get any database to work well enough for a pwa dictionary.
tingletech a day ago |
I don't think it has had a BSD license this century, Sleepy Cat was selling licenses in the 90s before Oracle bought them.
nostrademons a day ago |
It seems more like SSTables, which are widely used by open-source software like LevelDB, HBase, and Cassandra (and Google's BigTable) but AFAIK don't have a standard open-source reader (unless you want to pull the relevant source file out of Cassandra or LevelDB).
https://www.igvita.com/2012/02/06/sstable-and-log-structured...
tnelsond4 a day ago |
Yes this does seem almost exactly like what PeakSlab is doing. I'll have to investigate more how SSTables and memtables work to see if there's any improvements or anything to glean from them.
I'm storing null delimited sorted strings with an index that's an offset to where that entry begins, so if I wanted to do memtables for writing I would have to have two memtables and merge them on write? I've been considering this exact approach.
meindnoch a day ago |
It is crashing Safari.
tnelsond4 a day ago |
Works on my wife's old iphone. I don't have a mac to test things on.
gpvos a day ago |
A more standard solution would be cdb.[0] Although that doesn't support compressed data.
[0] https://cdb.cr.yp.to/ , https://en.wikipedia.org/wiki/Cdb_(software)
tnelsond4 a day ago |
I'll definitely have to look at this. I'm using binary search right now because it's fast enough and I understand it better, but I might replace it with hashes at some point.
sgbeal a day ago |
> Compare that to SQLite's 1.2mb of wasm and glue code
The current trunk is actually 1.7mb in its canonical unminified form (which includes very nearly as much docs as JS code), split almost evenly between the WASM and JS pieces :/. Edit: it is 1.2mb in minified form, though.
Disclosure: i'm its maintainer.
Edit: current trunk, for the sake of trivia:
sqlite3.wasm 896745 sqlite3.mjs 816270 # unminified w/ docs sqlite3.mjs 431388 # unminified w/o docs sqlite3.mjs 310975 # minified
smartmic a day ago |
Many comments here to your creation, PeakSlab, but not yet a dedicated praise. I didn't know it but I have to say it is really cool and innovative! The performance of the dictionary is indeed superb and I will definitely bookmark this for future reuse. So, in a nutshell: thanks for sharing!
raxxorraxor a day ago |
SQLite is simple in its own way and I like the design principle of their SQL dialect.
"Right joins are just left joins in the wrong direction, you don't need that crap"
Of course it always gets simpler or more specialised. I think many apps using databases would run with SQLite just as well. And some would probably run just as well with a textfile instead of any db like SQLite.
luckystarr a day ago |
For the love of god, don't do blank textiles anymore. In the end you have a software that has 20 (or more) individual files for each programs section, which works fine until you want the files to be consistent. Boom. And then you add a lock to fix it and suddenly your whole program can only run sequentially. And then your customers ask why it's so slow in ingress. I won't name any names here, but this is a real commercial product.
BenjiWiebe a day ago |
We use a cheap invoicing program. It works fine except it gets very slow when dealing with large numbers if invoices. Turns out each invoice (or payment record, or customer record, or whatever) is a separate text file with form-urlencoded data. No indices.
chungy a day ago |
> "Right joins are just left joins in the wrong direction, you don't need that crap"
SQLite has supported all types of joins since version 3.39 in 2022.
raxxorraxor a day ago |
Well, look at that, now it is downhill from here!
tnelsond4 a day ago |
I must've messed something up, but I remember some joins (was it full outer join?) being unbelievably slow? Was I doing something wrong?
chungy a day ago |
Too vague of a question to give you an answer you'll likely sound satisfactory :)
You probably just needed to create indexes over your data to speed things up.
tmountain a day ago |
Overkill in what way exactly? The LOC of the project shouldn't have any bearing on most people's usage of the project. SQLite is one of the well tested and mature projects in the world. What exactly would motivate someone to use PeakSlab instead? What problem are you solving?
IshKebab a day ago |
Read the comment. He's using it in WASM form and doesn't want users to have to download 1.2MB of SQLite every time they visit the page.
shermantanktop a day ago |
Client caches are a thing, so this is most relevant for cold-start customers. In that case PeakSlab’s download size is an advantage.
Fwiw LocalStorage is a SQLite db on most browsers, with a kv api. It’s be interesting to have the actual API available.
tnelsond4 a day ago |
Even on warm start PeakSlab is twice as fast. It's not just download size, it's execution speed, zero copy, database decompression, etc.
That's why PeakSlab is written in c, because what's faster than casting the whole database to a struct? ;-P
tekne a day ago |
I think web sqlite was originally an (experimental) thing
tnelsond4 a day ago |
I'm solving a simpler problem. Just making cross platform dictionary progressive web apps with indexes and full text search and HTML tags and uppercase letters inserted back into the text on render so they don't interfere with search.
SQLite is 1.2mb in combined wasm and JavaScript and not really designed for my use case, so I would have to add all the things i need anyway like compression and HTML tag insertion. For my use case which is just for pwas SQLite takes too long to load and the files are too big and the search isn't tailored. So I made something else in 38kb instead
tmountain a day ago |
Got it, thanks.
TheRealPomax a day ago |
If you're not modifying data, whatever system is using the data doesn't need a database at all, it just needs a data export.
kev009 a day ago |
Even if you aren't doing writes SQLite should be the default option as a file dependent format or even durable IPC. And it isn't going to fall away (i.e. BerkeleyDB, Tokyo Cabinet). I recently dug into WAL mode and it is pretty incredible for multiple readers https://github.com/InterNetNews/inn/pull/338.
themafia a day ago |
> but if you're not doing writes it's really overkill.
SQLite is not just a B-Tree+ file format manager. It's also a powerful CLI with tons of built in data manipulation and extraction tools which can all be scripted.
Finally, if your data is _relational_, then a format that pereserves those relations, even if they aren't intended to be updated, is exceptionally powerfull.
> This is a custom format built to be very fast to load (cast to a c-struct and done) and very fast to search with 3 binary search indexes built in.
What is the endianess of your format?
tnelsond4 a day ago |
Little endian. x86 and wasm have the same endianness so it's very straightforward. If I wanted to enable support for reading and writing on big endian systems I could patch a few lines very easily, but it seems to be a consideration unnecessary for most uses.
Instead of doing multiple columns, there's some of that with secondary and tertiary indexes, but mostly it's just key value.
To make the data more relational I find it's more natural to just have additional "columns" in separate peak files and when the peak files are queried they can be combined by headword. This is kinda like what parquet does putting all of one column together.
I could make PeakSlab have columns and have it query based off specific columns fairly easily, but it was a design decision to make things more tag based and less column based. I like that it's more flexible and less rigid in structure.
SQLite keeps everything relational and their raw filesizes are 50% larger than a .peak file, not sure what's exactly to blame, but I'm optimizing for small filesizes.
tracker1 18 hours ago |
I've had to deal with image scans and my own approach lately has been to quantize to 16-color (grayscale via pngquant) and oxipng with zopfli set to max. Mostly so that the output can be easily used in a browser UI.
Will definitely have to look into your solution with JBIG2 next time. Though I may keep with png just for compatibility sake.
tnelsond4 16 hours ago |
Yeah, I had Claude generate me the jbig2 to 1-bit png wasm module from the jbig2 source so that there'd be no dependencies on zlib or libpng. Then I aggressively removed functions from 97kb down to 17kb. So if it's missing functions you need you may need to reset the files to the .bak and recompile.
testermelon a day ago |
I'm surprised they included proprietary format that's de facto standard in profession or supported by multiple tools (.xls, .xlsx) in preferred section [1]. I wonder if "well-known enough" is as good as "open" from preservation standpoint.
[1] https://www.loc.gov/preservation/resources/rfs/data.html
pletnes a day ago |
You can unzip the xlsx and read the xml inside. It’s not the worst format by far.
perching_aix a day ago |
What would you reckon is the worst format? I'm very curious of your standards given this.
pletnes 5 hours ago |
A binary format that is only readable by some very specific version of the program writing it. The older xls comes to mind, but there must be thousands of examples.
mort96 a day ago |
Especially when Office 365 shows that not even Microsoft is capable of making software which can display Office files anymore... if you have a Word file which was created or has ever been modified by the Word application, working with it through Office 365 in a browser is such a pain. I've literally had images which are impossible to delete or move in the web version, and they will absolutely render in the wrong place.
acdha a day ago |
Archivists and librarians have to think in terms of practicality: if many tools exist to read something and it’s a mainstream software product, the odds are good that they’ll be unable to use those files 50 years from now. Not certain, but good, and that matters with limited budget and ability to tell the rest of the world what format to provide things in.
This can require nuance: for example, PDF has profiles because the core format is widely supported but you could do things like embed plugin content from now-defunct vendors and they would only want the former for long-term preservation.
guelo a day ago |
I get annoyed at all the other DBs that require their own heavy duty server process when for 90% of my projects there is only one client, my app server. Is there a DB that combines sqlite's embedded simplicity with higher concurrent write throughput?
TeriyakiBomb a day ago |
I think the concurrent write thing is not as much of an issue nowadays with the speed of NVMEs and WAL.
graemep a day ago |
Firebird, maybe?
llagerlof a day ago |
I used SQLite for a few applications several years ago. One time, the database got corrupted and all the data was lost. That was the day I stopped using SQLite.
Also, the lack of enforced column data types was always a negative for me.
jjice a day ago |
No matter the medium, backups are a must.
llagerlof a day ago |
A hard lesson learned...
justin66 a day ago |
> the database got corrupted
What caused that?
llagerlof a day ago |
I don't know why that happened, but one fine day I tried to open the file using the vanilla SQLite client, and it didn't open.
benhurmarcel a day ago |
For column types there are STRICT tables now
llagerlof a day ago |
Thank you!
Mawr a day ago |
I used a hard drive for a few applications several years ago. One time, the drive got corrupted and all the data was lost. That was the day I stopped using hard drives.
b40d-48b2-979e a day ago |
This isn't the same and you know it.
butterNaN a day ago |
(US)
semiquaver a day ago |
It certainly will be in the toolkits of data archeologists hundreds of years from now. Must be a weird feeling to create something so potentially long-lasting.
danborn26 a day ago |
It is great to see SQLite getting this level of institutional recognition. The single file format makes archival storage incredibly straightforward compared to traditional database dumps.
fpj a day ago |
I don't know much about the LoC use case, but my initial reaction to the post is to ask why they are not building a data lake with open formats. I'm sure there are reasons for discarding open-table formats. Claude keeps telling me that the issue is that they don't address preservation properly.
lenwood a day ago |
Just yesterday it occurred to me that it had been a while since I last saw an SQLite post at the top of HN.
I really like the simplicity and speed of SQLite, I've used in both personal and professional projects. For day-to-day work I still end up in Excel, not because I like it more (I don't), but because its ubiquity makes it the lowest friction way to share & explore datasets with less technical stakeholders and execs.
LoganDark a day ago |
I've always been irked by how SQLite relies on text parsing to work. Why is it that I have to write queries in text rather than expressing them in programmatic logic? I have never used a relational database because of this, because I hate them, but they can be more performant than pure structured data, but I hate SQL and the entire idea of SQL and I don't want to write it or learn it or use a system that relies on it. It feels like the wrong approach, on the level of PHP. Is there anything I can do to help this? I don't want to keep passing up SQLite just because of SQL, but I just can't seem to agree with it. I don't want to build strings or have string parsing anywhere in the stack, it just feels wrong.
weird-eye-issue a day ago |
I bet you really love LLMs
LoganDark a day ago |
I'm torn on LLMs. I've started to use them to accelerate personal development, but they still require a lot of babying and manual assistance. Still, they help a lot.
StilesCrisis a day ago |
A "prepared statement" is a precompiled SQL command, ready for bindings and execution: https://sqlite.org/c3ref/stmt.html
You can't precompile your SQL at build time, unfortunately, but you _can_ precompile all your SQL at the very start of your program and then never touch the parser again. This might be a good middle ground for you. It is infra that you can centralize, write some unit tests against, and then not worry about forever.
It's not common because the SQLite parser is lightning fast and it's so convenient to just write out a new query as you need one, versus having one bucket of all queries. But it's an option!
DowsingSpoon a day ago |
If you want to avoid string manipulation then you can construct queries with a query builder API like C#'s LINQ. Other languages have similar libraries, e.g., Rust has Diesel.
If your objection is to the SQL language itself then you might find Datalog interesting. Datalog is a logic-based language where you query by writing predicates rather than writing SQL statements. Check out Logica <https://logica.dev>. It's a language in the Datalog family that compiles to SQL.
In both cases, SQL is used only as a low-level IR for interfacing with the database engine.
1718627440 4 hours ago |
> Why is it that I have to write queries in text rather than expressing them in programmatic logic?
Do you only use graphical programming languages or how is your "programmatic logic" typically expressed?
gcardinal a day ago |
I'm under no illusion I'll suddenly shatter your world views with this, but in case it's as useful to you as it was to me, you might want to check out Metabase[1].
You can self-host and if all you care about is showing data in a digestible format to stakeholders, it's really simple. You can of course go overboard and regret all of your life's decisions with it, but I try and abstain myself.
[1]: https://www.metabase.com/
infogulch a day ago |
SQLite is remarkably versatile. Just a couple weeks ago an extension to do cross-process queues, streams, pub/sub etc in SQLite was released:
Show HN: Honker – Postgres NOTIFY/LISTEN Semantics for SQLite | 327 points | 94 comments | https://news.ycombinator.com/item?id=47874647
Live notifications was one of the big missing pieces to implement whole apps on a sqlite backend, and now there's a decent solution.
imrozim a day ago |
I use postgresql for my startup but every time i needed a quick local testing i wish it was as simple as sqlite. No config just works.
amai a day ago |
Which version of SQLite?
GeorgeTirebiter a day ago |
Now, if only the LoC would recognize the brilliance of the Fossil SCM ....
xiaod a day ago |
The operational complexity is worth comparing here. The migration path and schema evolution story often matter more than raw performance numbers for teams choosing between these options.
tracker1 20 hours ago |
I've used line-delimited, gzipped JSON for archive formats on several projects myself, which is a pretty good option... If I wanted more flexibility, would definitely consider SQLite.
In fact, I've worked on several projects, where I heavily advocated that even the primary app storage was SQLite, and that archival was simply copying the database after an event. Specifically, elections, petition verification, etc. It's kind of difficult coming up with complex schemas to handle multiple events as well as the state of data at those events... by separating the database itself, using SQLite, that simplifies a lot of thigs. Though it does, practically limit scale a bit. The main thing would be to archive the application and the database after a given event. If the application is containerized, you could create an image of the source, the container and the database after the event.
I think this kind of structure would work well for a lot of things... especially if you're considering data sharding anyway.
fragmede 14 hours ago |
Yes! Can it replace CSV, please?