Data Manipulation in Clojure Compared to R and Python

110 points by tosh 3 days ago | 36 comments

soumyaskartha 14 hours ago |
Clojure never got the data science crowd even though the language is genuinely good for it. Always felt like a distribution problem more than a technical one.
asa400 14 hours ago |
Unfortunately, having to mess around with a JVM is a tough sell for a lot of data analysis folks. I'm not saying it's rational or right, but a lot of people hear "JVM" and they go "no thank you". Personally I think it's a non-issue, but you have to meet people where they are.
famicom0 13 hours ago |
Meanwhile, I find it very annoying to deal with the litany of Python versions and the distinction between global packages and user packages, and needing to manage virtual environments just to run scripts. That being said, I am not an expert but that's always been my experience when I need to do anything Python related.
pjmlp 13 hours ago |
The irony given the mess of Python setup where there are companies whose business is to solve Python tooling.
asa400 7 hours ago |
Oh, I completely agree. Like I said, it's not rational, but it is what it is.
cmiles74 12 hours ago |
I dunno, if you can slog through the Python ecosystem then the JVM is starting to look not so bad. Plus with Clojure you don't need to deal with the headache and heartache that is Maven.
KingMob 6 hours ago |
I think that's true for only a limited subset of programs, though. The Clojure lib ecosystem is nowhere near the size of the broader Java ecosystem, so you frequently end up pulling Maven deps to plug holes anyway.
pjmlp 2 hours ago |
That is the goal of a polyglot runtime, and why Clojure was designed to be a hosted language that embraces the platform, unlike others that make their tiny island.
packetlost 12 hours ago |
idk, I don't think I've had to do anything beyond install the JVM to work with Clojure. I'm not really a fan of the clj commands flag choices though (-M, -X, etc. all make no sense)
KingMob 6 hours ago |
It's unfortunate, but people's associations with Java the lang bleed into their beliefs about the JVM, one of the most heavily-optimized VMs on the planet.
There's some historical cruft (especially the memory model), but picking the JVM as a target is a great decision (especially with Graal offering even more options).
pjmlp 2 hours ago |
Exactly, especially because there isn't THE JVM, rather a bunch of versions each with their own approaches to GC, JIT, JIT caches, ahead of time compilation.
Only .NET follows up on it at scale.
levocardia 13 hours ago |
In this very post you can see why: the dplyr code is just so much more readable. Like a lot of python, dplyr reads almost like pseudocode: take this dataset, select the columns that start with "bill", then filter so that bill_length is less than 30. So simple and so little fluff!
erichocean 13 hours ago |
> is just so much more readable
I thought that too before I learned Clojure, now I find them equally readable.
lemming 9 hours ago |
I'm very familiar with Clojure, but even I can't make a good argument that:
(tc/select-rows ds #(> (% "year") 2008))
is more, or at least as, intuitive as:
filter(ds, year > 2008)
as cited above. I think there's a good argument to be made that Clojure's data processing abilities, particularly around immutable data, make a compelling case in spite of the syntax. The REPL is great too, and the JVM is fast. But I still to this day imagine infix comparisons in my head and then mentally move the comparator to the front of the list to make sure I get it right.
Capricorn2481 8 hours ago |
I am really not in data science, and I have decent Clojure experience. Is there a reason anyone would pick Clojure over something like K? From what I understand, those array languages are really good for writing safe but efficient code on rectangular data.
erichocean 2 hours ago |
How about this?
(filter ds (> year 2008))
That's a trivial Clojure macro to make work if it's what you find "intuitive."
hatmatrix 10 hours ago |
Julia's Tidier.jl ecosystem is getting there too. It uses macros to mimic this 'special' evaluation framework of R, so the code is also readable in a similar way.
ertucetin 14 hours ago |
I’ve built many different kinds of software (backend, frontend, 3D games, cli tools, code editor, and more) with Clojure and have been using it for over a decade now.
I can confidently say that, among the list I mentioned, it’s the best for data manipulation/transformation. Thanks to the author for presenting it clearly and showing how the libraries and code look across different languages, all of which do a great job.
But Clojure has its own special place (maybe in my heart as well :). I think Clojure should be used more in the data science space. Thanks to the JVM, it can be very performant (I’m looking at you, Python).
hatmatrix 10 hours ago |
There was XLISP-STAT before R, but the scientists have spoken. They don't like the parentheses.
__mharrison__ 14 hours ago |
Good pandas and polars code should also be written in an immutable way...
epgui 14 hours ago |
Good python code can exist, but python makes it so easy to write bad code that good python rarely exists.
nxpnsv 14 hours ago |
Agree. While it is common to see code like these pandas examples, it is very possible to write these manipulations so that they return a new frame or view without changing the inputs.
olivia-banks 13 hours ago |
Having "NA" being treated as nil/null/None by default seems like it would cause the Namibia problem!
QubridAI 12 hours ago |
Interesting perspective Clojure’s immutable, functional approach makes data wrangling feel very different from the more imperative style of R and Python.
thrawa8387336 12 hours ago |
I always wished Incanter took off.
zmmmmm 10 hours ago |
Seems like it's going to be a tough sell to get people to want to write
(tc/select-rows ds #(> (% "year") 2008))
instead of
filter(ds, year > 2008)
They seem to ignore the existance of Spark, so even if you specifically want to use JVM it feels clearer and simpler:
ds.filter(r => r.year > 2008)
condwanaland 10 hours ago |
Couldn't agree more. R and dplyrs ability to pass column names as unquoted objects actually reduces cognitive load for new people so much (pure anecdata, nothing to back this up except lots of teaching people).
And that's on top of the vastly simpler syntax compared to what's being shown here
geokon 6 hours ago |
In my experience the advantage comes when you have a few more lines of code
The Clojure pipelining makes code much more readable. Granted dplyr has them too, but tidyverse pipes always felt like a hack on top of R (though my experience is dated here). While in Clojure I always feel like I'm playing with the fundamental language data-types/protocols. I can extend things in any way I want
aphyr 6 hours ago |
You're right, that is longer! I get why though; `filter` is a clojure.core function name people don't necessarily feel comfortable shadowing, and the Clojure and Spark versions make it clear what's a symbol in local scope versus a field in the dataset. I don't think it'd be hard to make a little wrapper for this sort of thing though! Here's an example which turns any symbols not in local scope into field lookups on an implicit row variable.
(require '[clojure.walk :refer [postwalk]]) (defmacro filter [ds & anaphoric-pred] (let [row-name (gensym 'row) pred (postwalk (fn [form] (if (and (symbol? form) (nil? (resolve form))) `(get ~row-name ~(str form)) form)) anaphoric-pred)] `(tc/select-rows ds (fn [~row-name] ~@pred))))
Now you can write
(filter ds (> year 2008))
And it'll expand to the ts form:
(pprint (macroexpand '(filter ds (> year 2008)))) => (tc/select-rows ds (fn [row2411] (> (get row2411 "year") 2008)))
teleforce 10 hours ago |
All the comparisons are with scripting and untyped languages perhaps for faster development and more intuitive eco-system to increase developer productivity.
In the age of IntelliSense, auto-completion and AI assisted coding, does the choice of scripting and untyped language justifiable for increased in productivity at the expense of safety and reliability?
If you're building data system not just for exploratory, surely modern compiled and typed system languages like Rust and D language make more sense for safety and reliability for the end users?
Even more so with D language where you can even have scripting capability for exploratory and protyping stage with its built-in REPL facility [1],[2]. This is feasible due to its very fast compile time unlike Rust. It has more intuitive "Phytonic" syntax compared to other typed languages [3]. You can also program with GC on by default if you choose to. Apparently, you can have your cake and eat it too.
[1] drepl:
https://github.com/dlang-community/drepl
[2] Why I use the D programming language for scripting:
https://opensource.com/article/21/1/d-scripting
[3] All in on DLang: Why I pivoted to D for web, teaching, and graphics in 2025 and beyond! [PDF]
https://dconf.org/2025/slides/shah.pdf
geokon 6 hours ago |
It's a bit apples to oranges.
If you're "building data system not just for exploratory" then you're probably not going to be using any of the presented options. However, in my experience Clojure has an ecosystem where there it is very easy to transition from exploring/playing with data at the REPL to a more robust "pro" setup that's designed to scale, handle failures, etc.
teleforce 4 hours ago |
I understand the sentiments but I disagree with the approach, it's probably efficient for exploratory but not effective for everything else including prototyping and systems development.
For any engineering work, including software engineering you choose the best tool for the job. In D you can have the high performance tool capable of bit shifting, string processing, array manipulation (to name a few) and from scripts to highly concurrent low-latency applications (see presentation in the ref [3] above by Prof. Shah from Yale).
It's a shame that the proper typed programming language are being ignored just because of programmers' locally sub-optimal preferences and limited exposure. The productivity increased using typical scripting languages including Python is diminishing everyday with the proliferation of IntelliSense, auto-complete and AI assisted coding.
For production codes, the scripting language based systems if they ever made it to production (mostly do e.g AirBNB, Twitter, Shopify, Github, etc) will be a maintenance headache and user nightmare, if the supports are not great and not unicorn start-ups. The last thing you want is that your saved eclaim form that you spent many hours preparing totally dissapeared since the system cannot recall the saved version. Granted this can be because of many reasons, but most of the problematic production systems are mostly written in scripting languages including Python because these are the only language the programmers know and familiar with. Adding to the insults are the readily available so called "battery included" libraries are convenients but ironically written in other compiled but unsafe system language in C/C++.
geokon 4 hours ago |
I think you're going to trouble convincing people a compile-loop language is going to be on-par with a REPL/interactive setup. You can look at some extreme example like MATLAB. With all your tools you're never going to reach the same level of interactive productivity with D for the subset of problems it's address.
You can have all your tools dump out and rewrite the oodles of boiler plate your typed languages require - but at the end of the day you have to read all that junk... or not? and just vibecode and #yolo it? But then you're back to "safety and reliability" problems and you haven't won anything
Also "safety and reliability" are just non-goals in a lot of contexts. My shitty plotting script doesn't care about "safety". It's not sitting on the network. It's reliable enough for the subset of inputs I provide it. I don't need to handle every conceivable corner case. I have other things to do
> Adding to the insults are the available readily available libraries are convenients but ironically written in other compiled but unsafe system language in C/C++
No on cares if you leak memory in some corner case with some esoteric inputs. And noone is worried your BLAS bindings are going to leak your secrets. These are just not objectives
teleforce 3 hours ago |
My point is that Dlang scales from beginner to expert, from scripting to highly concurrent low-latency applications. Why settle for sub-optimal scripting languages if you can have the real deal with much better performance and freely available open source?
In the automative world if you can afford it, you need daily drive car for the job and supermarket runs, weekend supercar for fun/showing off, and off-road 4x4 vehicles for overnight camping. But in the software world D can cater for mostly everything with free open-source compilers, minimum productivity overhead and much cheaper to host as well [1].
Funny you mentioned BLAS, since Dlang BLAS implementation has also surpassed the run-of-the-mill high performance BLAS library that these scripting languages can only dream of (Matlab calling the 3rd party Fortran codes no less) [2].
[1] Saving Money by Switching from PHP to D:
https://dlang.org/blog/2019/09/30/saving-money-by-switching-...
[2] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:
http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...
zelphirkalt 30 minutes ago |
One general problem or challenge with statically strongly typed languages is, that one can quick get to a local optimum, but that local optimum might lack some flexibility, that is needed later on, only discovered after some usage and seeing many use cases. Then a big refactoring is ahead, possibly even of the core types of the project. If that is allowed and introducing such flexibility thought of, it often happens, that expressing it in types becomes quite complex, which, without a lot of care, will impact the user of the project. The user needs to adhere to the same types and there might then be quite some ceremony around making something of the correct type, to use it with the project.
It is safer, but it is not without its downsides. It demands a careful design to make something people will enjoy using.
manudaro 6 hours ago |
The Clojure tablecloth performance numbers here are pretty surprising, usually see Python/polars dominating these benchmarks. Been running similar transformations on transit data feeds and polars consistently outperforms pandas by 3x-5x on the group-by operations, but hadn't considered Clojure for the pipeline. Anyone actually using tablecloth in production data workflows?