Here's a neat trick I picked up from the source code:
indices := fdr.rgx.FindAllStringSubmatchIndex(text, -1)
for _, pair := range indices {
start := pair[0]
end := pair[1]
leftStart := max(0, start-CONTEXT_LENGTH)
rightEnd := min(end+CONTEXT_LENGTH, len(text))
// TODO: this doesn't work with Unicode
if start > 0 && isLetter(text[start-1]) {
continue
}
if end < len(text) && isLetter(text[end]) {
continue
}
An earlier comment explains this: // The '\b' word boundary regex pattern is very slow. So we don't use it here and
// instead filter for word boundaries inside `findConcordance`.
// TODO: case-insensitive matching - (?i) flag (but it's slow)
pattern := regexp.QuoteMeta(keyword)
So instead of `\bWORD\b` it does the simplest possible match and then checks to see if the character one index before the match and or one index after the matches are also letters. If they are it skips the match.Why not use a precomputed posting list?
> The server reads all the documents into memory at start-up. The corpus occupies about 600 MB, so this is reasonable, though it pushes the limits of what a cloud server with 1 GB of RAM can handle. With 2 GB, it's no problem.
1200 books per 1GB server? Whole-internet search engines are older than 1GB servers.
> queries that take 2,000 milliseconds from disk can be done in 800 milliseconds from memory. That's still too slow, though, which is why fast-concordance uses [lots of threads]
No query should ever take either of those amounts of time. And the "optimisation" is to just use more threads. Which other consumers could have used to run their searches, but now can't.
https://www.pingdom.com/blog/original-google-setup-at-stanfo...
500 Internal Server Error