There's some historical cruft (especially the memory model), but picking the JVM as a target is a great decision (especially with Graal offering even more options).
Only .NET follows up on it at scale.
I thought that too before I learned Clojure, now I find them equally readable.
(tc/select-rows ds #(> (% "year") 2008))
is more, or at least as, intuitive as: filter(ds, year > 2008)
as cited above. I think there's a good argument to be made that Clojure's data processing abilities, particularly around immutable data, make a compelling case in spite of the syntax. The REPL is great too, and the JVM is fast. But I still to this day imagine infix comparisons in my head and then mentally move the comparator to the front of the list to make sure I get it right. (filter ds (> year 2008))
That's a trivial Clojure macro to make work if it's what you find "intuitive."I can confidently say that, among the list I mentioned, it’s the best for data manipulation/transformation. Thanks to the author for presenting it clearly and showing how the libraries and code look across different languages, all of which do a great job.
But Clojure has its own special place (maybe in my heart as well :). I think Clojure should be used more in the data science space. Thanks to the JVM, it can be very performant (I’m looking at you, Python).
(tc/select-rows ds #(> (% "year") 2008))
instead of filter(ds, year > 2008)
They seem to ignore the existance of Spark, so even if you specifically want to use JVM it feels clearer and simpler: ds.filter(r => r.year > 2008)And that's on top of the vastly simpler syntax compared to what's being shown here
The Clojure pipelining makes code much more readable. Granted dplyr has them too, but tidyverse pipes always felt like a hack on top of R (though my experience is dated here). While in Clojure I always feel like I'm playing with the fundamental language data-types/protocols. I can extend things in any way I want
(require '[clojure.walk :refer [postwalk]])
(defmacro filter
[ds & anaphoric-pred]
(let [row-name (gensym 'row)
pred (postwalk (fn [form]
(if (and (symbol? form) (nil? (resolve form)))
`(get ~row-name ~(str form))
form))
anaphoric-pred)]
`(tc/select-rows ds (fn [~row-name] ~@pred))))
Now you can write (filter ds (> year 2008))
And it'll expand to the ts form: (pprint (macroexpand '(filter ds (> year 2008))))
=> (tc/select-rows ds (fn [row2411] (> (get row2411 "year") 2008)))In the age of IntelliSense, auto-completion and AI assisted coding, does the choice of scripting and untyped language justifiable for increased in productivity at the expense of safety and reliability?
If you're building data system not just for exploratory, surely modern compiled and typed system languages like Rust and D language make more sense for safety and reliability for the end users?
Even more so with D language where you can even have scripting capability for exploratory and protyping stage with its built-in REPL facility [1],[2]. This is feasible due to its very fast compile time unlike Rust. It has more intuitive "Phytonic" syntax compared to other typed languages [3]. You can also program with GC on by default if you choose to. Apparently, you can have your cake and eat it too.
[1] drepl:
https://github.com/dlang-community/drepl
[2] Why I use the D programming language for scripting:
https://opensource.com/article/21/1/d-scripting
[3] All in on DLang: Why I pivoted to D for web, teaching, and graphics in 2025 and beyond! [PDF]
If you're "building data system not just for exploratory" then you're probably not going to be using any of the presented options. However, in my experience Clojure has an ecosystem where there it is very easy to transition from exploring/playing with data at the REPL to a more robust "pro" setup that's designed to scale, handle failures, etc.
For any engineering work, including software engineering you choose the best tool for the job. In D you can have the high performance tool capable of bit shifting, string processing, array manipulation (to name a few) and from scripts to highly concurrent low-latency applications (see presentation in the ref [3] above by Prof. Shah from Yale).
It's a shame that the proper typed programming language are being ignored just because of programmers' locally sub-optimal preferences and limited exposure. The productivity increased using typical scripting languages including Python is diminishing everyday with the proliferation of IntelliSense, auto-complete and AI assisted coding.
For production codes, the scripting language based systems if they ever made it to production (mostly do e.g AirBNB, Twitter, Shopify, Github, etc) will be a maintenance headache and user nightmare, if the supports are not great and not unicorn start-ups. The last thing you want is that your saved eclaim form that you spent many hours preparing totally dissapeared since the system cannot recall the saved version. Granted this can be because of many reasons, but most of the problematic production systems are mostly written in scripting languages including Python because these are the only language the programmers know and familiar with. Adding to the insults are the readily available so called "battery included" libraries are convenients but ironically written in other compiled but unsafe system language in C/C++.
You can have all your tools dump out and rewrite the oodles of boiler plate your typed languages require - but at the end of the day you have to read all that junk... or not? and just vibecode and #yolo it? But then you're back to "safety and reliability" problems and you haven't won anything
Also "safety and reliability" are just non-goals in a lot of contexts. My shitty plotting script doesn't care about "safety". It's not sitting on the network. It's reliable enough for the subset of inputs I provide it. I don't need to handle every conceivable corner case. I have other things to do
> Adding to the insults are the available readily available libraries are convenients but ironically written in other compiled but unsafe system language in C/C++
No on cares if you leak memory in some corner case with some esoteric inputs. And noone is worried your BLAS bindings are going to leak your secrets. These are just not objectives
In the automative world if you can afford it, you need daily drive car for the job and supermarket runs, weekend supercar for fun/showing off, and off-road 4x4 vehicles for overnight camping. But in the software world D can cater for mostly everything with free open-source compilers, minimum productivity overhead and much cheaper to host as well [1].
Funny you mentioned BLAS, since Dlang BLAS implementation has also surpassed the run-of-the-mill high performance BLAS library that these scripting languages can only dream of (Matlab calling the 3rd party Fortran codes no less) [2].
[1] Saving Money by Switching from PHP to D:
https://dlang.org/blog/2019/09/30/saving-money-by-switching-...
[2] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:
http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...
It is safer, but it is not without its downsides. It demands a careful design to make something people will enjoy using.