The next problem we have is how to build this index in a reasonable amount of time (remember, this took months in our first iteration). As is often the case, the trick here is to identify some insight into the specific data we’re working with to guide our approach. In our case it’s two things: Git’s use of content addressable hashing and the fact that there’s actually quite a lot of duplicate content on GitHub. Those two insights lead us the the following decisions:
- Shard by Git blob object ID which gives us a nice way of evenly distributing documents between the shards while avoiding any duplication. There won’t be any hot servers due to special repositories and we can easily scale the number of shards as necessary.
- Model the index as a tree and use delta encoding to reduce the amount of crawling and to optimize the metadata in our index. For us, metadata are things like the list of locations where a document appears (which path, branch, and repository) and information about those objects (repository name, owner, visibility, etc.). This data can be quite large for popular content. Embrace the shape of your data/domain, build vs. buy
The technology behind GitHub’s new code search
from Timothy Clem
Filed under:
Related Notes
- Dependencies (coupling) is an important concern to address, but it&...from kbouck
- By replacing integration tests with unit tests, we're losing al...from Computer Things
- Often, people who don’t have access to the raw data expect one narr...from Josh Beckman
- The upshot for the industry at large, is: the **LLM-as-Moat model h...from Steve Yegge
- The first image ever transmitted to Earth from another planet was r...from Instagram
- I propose that there is one problem chief among them, an impetus fo...from George Hosu
- When software -- or idea-ware for that matter -- fails to be access...from gist.github.com
- My experience is companies do not anticipate that the cost of monit...from Mathew Duggan