Dataframe 1.0.0.0

107 points by internet_points a day ago | 24 comments

october8140 a day ago |
1.0.0.0.0.0.0.0
qrobit a day ago |
Hackage recommends using Haskell's PVP[^1], but does not enforce it. That's why many haskell packages are a four-places versions: 3 required and fourth optional (but popular) that represents "other" changes, like documentation.
[^1]: https://pvp.haskell.org/
Y-bar a day ago |
> A.B is known as the major version number
Why are they requiring two numbers to represent one (semantic) number?
winwang a day ago |
(no idea but) I feel like changing the first number has a psychological issue, but the 2nd number feels more important than just "minor" sometimes. So may as well let the schema set the mind free?
tikhonj a day ago |
I rather like this. A represents major changes like a substantial redesign of the whole API, while B catches all other breaking changes. Tiny changes to the public API of a library may not be strictly backwards compatible, even if they don't affect most users of the package or require substantial work to address.
A problem with Semver is that a jump from 101.1.2 to 102.0.0 might be a trivial upgrade, and then the jump to 103.0.0 requires rewriting half your code. With two major version numbers, that would be 1.101.1.2 to 1.102.0.0 to 2.0.0.0. That makes the difference immediately clear, and lets library authors push a 1.103.0.0 release if they really need to.
In practice, with Semver, changes like this get reflected in the package name instead of the version number. (Like maybe you go from data-frames 101.1.2 to data-frames-2 1.0.0.) But there's no consistent convention for how this works, and it always felt awkward to me, especially if the intention is that everyone migrates to the new version of the API eventually.
Y-bar a day ago |
You put into words why I appreciate SemVer so much! It is so much better at being deterministic and therefore allows me a greater confidence in version control.
The author of a library has no idea how tightly coupled my code is to theirs and should therefore only make yes/no answers to ”is this a breaking” change.
For example, when a large ORM library si use changed a small thing like ”no longer expose db tables for certain queries because not all db engines support it anyway” (ie moving a protected property to private) it required a two week effort to restructure the code base.
> In practice, with Semver, changes like this get reflected in the package name instead of the version number.
Not once have I seen this happen. Any specific examples?
philipwhiuk a day ago |
> MAY optionally have *any* number of additional components, for example 2.1.0.4
Thus making the silly example possible.
whateveracct a day ago |
Also, iirc PVP pre-dates SemVer. For anyone going to accuse Haskell of NIH :)
Remember, everyone: Haskell is very old!
nickpeterson 19 hours ago |
risky, it feels like there is a chance you'll still need an extra .0 to cover something unexpected.
torcete 19 hours ago |
I can't wait for version 1.0.0.0.0.0.0.1
hambandit a day ago |
I learned some haskell as my hobby language a few years back. It was very cool and forced me to think about programming differently (and finally grok recursion). It never felt like a good language for data analysis to me though (maybe that's cause this library wasn't around? lol). This isn't meant a criticism of this library, instead, I'm curious the use cases the author, if you're around or a user, has in mind. Congrats on the v1 release!
mchav a day ago |
Author here. At the time I worked in fraud detection and we needed to automate file generation for our BRMS. Initially created this to experiment with “models as dataframe expressions” and Haskell is great for DSL-like stuff. That work is still on going: https://github.com/DataHaskell/symbolic-regression and dataframe has a native sparse oblique tree implementation.
As it’s grown it’s been pretty cool to have transparent schema transformations instead of every function mapping a statement a dataframe you can have function signatures like:
``` extract :: TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int, Column "comments" T.Text] -> TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int] -- body of extract
transform :: TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int] -> TypedDataFrame [Column "price" Double, Column "quantity" Int] -- body of transform
clean :: TypedDataFrame [Column "price" (Maybe Double), Column "quantity" Int, Column "comments" T.Text] -> TypedDataFrame [Column "price" Double, Column "quantity" Int] clean = transform . extract ```
But you can also do the simple thing too and only worry about type safety if you prefer:
``` df |> D.filterWhere (country_code .==. "JPN") |> D.select [F.name name] |> D.take 5 ```
Being able to work across that whole spectrum of type safety is pretty great.
whateveracct a day ago |
And packed in here is more than Dataframe.
DataHaskell in general is revived and improving on multiple fronts. Exciting stuff!
brightball a day ago |
If anybody is reading this and would like to submit a talk on it or Haskell itself to the Carolina Code Conference, please do so. Our call for speakers is open until the end of March and I've been hoping to get a Haskell talk in for the last couple of years.
https://blog.carolina.codes/p/call-for-speakers-2026-is-open
mchav a day ago |
Author here: Would have loved to but this is round about my wedding anniversary. Will ask some Haskell friends to submit though.
brightball 19 hours ago |
Thanks!
octopoc a day ago |
> There is now a DataFrame.Typed API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work.
This makes complex dashboards so much easier to build, because in Python you have to test everything in the dashboard to make sure a change to a common dataset didn’t break anything.
Is there a good web dashboard library like streamlit for Haskell I wonder?
mchav 14 hours ago |
No but something is in the works! We are building reactive notebooks that we will eventually give export capabilties.
You can try it from https://www.datahaskell.org/ under "try out our current stack"
mark_l_watson a day ago |
This looks so cool, just put it on top of my todo list. My Haskell skills are mediocre but I love the language. I get by with a subset of the language.
Strong typing and data science seems like a good combination.
steve_adams_86 18 hours ago |
In my experience it's tough to sell to some scientists (they like to work with R and Python here), but when it's tied with pipelines that ultimately publish materials (rather than every step of the way), it's extremely helpful and streamlines the QA process of ensuring correct data is packaged with publications.
When I've audited some of the published data at our org, there are errors that would have been caught with even basic type-safety. That's how I got the green light to start harassing my team with type safety in our pipelines.
Of course, as with all things in programming, it isn't a silver bullet. It adds a layer of rigor that can slow things down, and there are often (seemingly always) nuances which can't be caught easily by most type systems. Things like complex relations between values (like 'if in Y is in [range], X must be null, and Z must be one of [a, b, c]'). Even so, eliminating categories of errors is worthwhile, and makes it easier to focus on the more complex challenges.
Over all I'd agree though, it's a good combination.
ghc a day ago |
I feel like I've been waiting for this to mature for a decade. I love that the vision has been realized despite the enthusiasm for functional programming languages cooling off somewhat.
mchav 8 hours ago |
Had always hoped for something like this since the days of Spark and Frameless. Better late than never.
Now hoping to build a bunch of Neuro symbolic AI on top of this.
MoonWalk 17 hours ago |
Is what?
huss-mo 15 hours ago |
why choose an overlapping name with pandas dataframe?