How We Built DoltHub: Stack and Architecture

March 11, 2020

5 min read

In our introductory article for this series, we took a high-level look at the technology stack and architecture behind DoltHub, the online home for Dolt data repositories. In this article, we'll delve a little deeper and discuss how the pieces of the system are organized and how they communicate with each other.

DoltHub API

This is DoltHub's brain. It's a collection of gRPC services (more on what that means in a minute) written in Go. It's a fairly complex architecture designed to be modular and scalable, but there are some highlights.

Domain layer

The domain layer is where database and core business logic reside, and where we define our domain objects. Domain objects are the "things" in an application: for DoltHub, think Repository and User defined as Go structs. In this layer, we define models that represent these objects along with mappings and procedures for storing and retrieving them from the database. Generally, each model in the domain layer has its own database table. For our database, we use PostgreSQL in production and SQLite for things like local testing.

In addition to low-level database CRUD, our domain layer implements a number of “use cases”, which describe high-level interactions with the system that may touch several tables. For example, we define a use case for Repository called GetByOwnerNameTuple which takes the names of a repository and its owner as strings and returns a Repository object if a corresponding record is found in the database.

func (impl *repositoryUseCaseImpl) GetByOwnerNameTuple(ctx context.Context, owner, name string) (*Repository, error) {
	user, err := impl.userRepo.GetByName(ctx, owner)
	if err != nil {
		return nil, err
	}
	if user != nil {
		return impl.repoRepo.GetUserRepoByName(ctx, user.ID, name)
	} else {
		org, err := impl.orgRepo.GetByName(ctx, owner)
		if err != nil {
			return nil, err
		}
		if org == nil {
			return nil, nil
		}
		return impl.repoRepo.GetOrgRepoByName(ctx, org.ID, name)
	}
}

The code for the GetByOwnerNameTuple use case in our domain layer for Repositories

To do its job, GetByOwnerNameTuple first reaches into the users table to see if the owner name given matches a user's name. If not, it then looks in the organizations table to see if the name matches an organization instead. Finally, if either a user or organization is found, the use case uses its ID as a foreign key into the repositories table to find the repository itself.

In general, the domain layer is allowed to assume that its inputs are valid and have been appropriately gated on permissions. And that brings us to the service layer...

Service layer

The service layer handles communication between the DoltHub API and consumers like our GraphQL server (more on that below), the Dolt command line client, and anyone else with a gRPC client and the URL of our endpoint. Its job includes validating and authenticating requests from clients, then translating the data into a format the domain layer understands. The domain layer is allowed to assume an orderly universe because the service layer does the hard work of wrangling the chaos of the outside world.

Earlier we mentioned that the DoltHub API is a collection of gRPC services. gRPC, which according to the FAQ stands for "gRPC Remote Procedure Calls", is a remote procedure call framework originally developed by Google for communication between clients and servers. You can think of it loosely as a more advanced alternative to something like a REST API, but that's a nuanced and controversial topic and each has its advantages and disadvantages.

In any event, the service layer is where our gRPC calls are actually implemented. The calls available and the messages they send and receive are defined using protocol buffers (“protobufs”), "Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data"—like JSON, but engineered to be as small and fast as possible.

One of the key features of protobufs is the wide variety of clients and code generators available for working with them in different languages. (As an aside, TypeScript generation is provided by a third-party package, and the heavily object-oriented style of the generated clients isn't our favorite. However, since introducing the GraphQL layer, our use of them has been reduced to a tolerable level.)

Deployment

The whole system is containerized with Docker and deployed to AWS using Kubernetes. We've got a serious amount of YAML and other configuration supporting it. We could write an entire article on our deployment technology, and may someday. For now, if you want to know more about how Kubernetes works, the website has a concepts guide to help get you started.

DoltHub

As far as the actual DoltHub web application is concerned, three separate apps built with TypeScript and React make up what you see at dolthub.com:

The main DoltHub application, built using Next.js
The documentation website, a Gatsby app hacked up a bit to accommodate the documentation use case
This blog, a more typical Gatsby app

In addition, there is a GraphQL server, which we built using NestJS and TypeGraphQL. By design, it is tightly coupled to DoltHub; it exists to communicate with the API so that DoltHub doesn't have to. Before we added the GraphQL layer, really gross code for communicating with the API was all over the front end; now it all lives in the GraphQL layer and we use Apollo Client to query it very ergonomically from DoltHub. We'll go into more depth on our use of GraphQL in a future article in this series, but that's the gist.

You might ask: why three separate front-end apps? Good question. Originally, this blog was a part of our corporate website, which was a Gatsby app. We like Gatsby, so when we decided to migrate the blog to live at dolthub.com/blog, we wanted to keep using it if possible. It turns out that Gatsby supports a simple pathPrefix configuration rule for exactly this.

Not much later, we decided we wanted a documentation website with docs written in Markdown and living at dolthub.com/docs. Again, Gatsby is a natural choice for such an application, and since it was reasonably painless for the blog we decided to use it for docs as well.

I say “reasonably”—there have been a few deployment and developer experience issues owing to this setup. For example, we have a library of React components shared between the three apps, and relative links are a huge pain, since the Gatsby pathPrefix magic doesn't work properly on the shared components. But overall we're happy.

All four of the deployments mentioned above have their own Dockerfiles and are deployed to AWS with Kubernetes. A current pain point in our deployment process is long build times due to lack of caching for Yarn. It seems like there may be solutions, but we haven't gotten around to trying any.

To summarize:

The DoltHub API consists of gRPC services defined using protocol buffers and implemented in Go. It has a domain layer that handles business logic and persistence, and a service layer that handles communication with the outside world and all the gatekeeping that entails.

The DoltHub web application is a Next.js app that communicates with a NestJS GraphQL server, which in turn talks to DoltHub API using generated gRPC clients. The documentation and blog are separate Gatsby apps despite being hosted as "subdirectories" of dolthub.com

We hope that this has been an interesting look into a relatively complex system. There's much more to say about many of the topics we discussed, and many things we didn't touch upon at all. If there's something we didn't cover that you'd like to know, please let us know here—we're constantly looking for topics to blog.

And if you haven't yet, head over to DoltHub, get yourself a copy of Dolt, and let us know what you think. Your feedback is invaluable in helping us build the future of data.

Blog