Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

44 points by adammajcher 3 hours ago | 12 comments

mechazawa 2 hours ago |
Is only bun supported or also regular node?
hersko 2 hours ago |
I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?
mimim1mi 2 hours ago |
By definition, OCR means optical character recognition. It depends on the contents of the PDF what kind of extraction methodology can work. Often some available PDFs are just scans of printed documents or handwritten notes. If machine readable text is available your approach is great.
trollbridge an hour ago |
I always render an image and OCR that so I don’t get odd problems from invisible text and it also avoids being affected by anything for SEO.
saaaaaam an hour ago |
There was an interesting discussion on here a couple of months back about images vs text, driven by this article: https://www.seangoedecke.com/text-tokens-as-image-tokens/
Discussion is here: https://news.ycombinator.com/item?id=45652952
sgc 2 hours ago |
How does this compare to dots.ocr? I got fantastic results when I tested dots.
https://github.com/rednote-hilab/dots.ocr
mjrpes an hour ago |
Ocrbase is CUDA only while dots.ocr uses vLLM, so should support ROCm/AMD cards?
actionfromafar 32 minutes ago |
How about CPU?
v3ss0n an hour ago |
How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.
jadbox 43 minutes ago |
Sounds like someone needs to run their own test cases and report back on which solution does a better job...
sync 25 minutes ago |
This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...
constantinum 19 minutes ago |
What matters most is how well OCR and structured data extraction tools handle documents with high variation at production scale. In real workflows like accounting, every invoice, purchase order, or contract can look different. The extraction system must still work reliably across these variations with minimal ongoing tweaks.
Equally important is how easily you can build a human-in-the-loop review layer on top of the tool. This is needed not only to improve accuracy, but also for compliance—especially in regulated industries like insurance.
Other tools in this space:
LLMWhisperer/Unstract(AGPL)
Reducto
Extend Ai
LLamaparse
Docling