Database Ranking on DoltHub

March 25, 2024

4 min read

For those of you that are new, Dolt is a database that supports Git-style versioning. DoltHub is a place on the internet to share and collaborate on Dolt databases. The Discover page on DoltHub is where users can explore trending databases, discover projects to contribute to, and connect with the community.

Until recently, databases in the Discover section could be sorted by either most recently updated or star count. We realized that these methods didn't fully reflect what make a database most valuable to our community. Questions like “How would the Discover section help people find interesting databases?” or “What factors make a database rank higher than others?” inspired us to develop a new database ranking system, which highlights databases that meet specific criteria beyond just recency or star count. Today's blog post dives into how we calculate database scores and how you can improve your database's visibility.

Our Ranking Algorithm Explained

Our algorithm is designed to spotlight databases that are not only popular but also demonstrate active engagement with DoltHub's features. This allows us to evaluate databases on several factors, including the number of stars, forks, and collaborators.

Key elements of our ranking criteria include:

Popularity. Measured by the number of stars and forks, indicating the community's interest and engagement.
Update recency. We reward databases that receive frequent updates, ensuring users have access to the most current data.
Usage of DoltHub features. We love seeing DoltHub features in action! Active use of DoltHub’s features, such as pull requests, file imports, and long-running SQL queries, not only benefits your workflow but also boosts your database's visibility on the platform.
Collaborate. Like GitHub, DoltHub is a central place to collaborate on Dolt databases. You can give read and write permissions to your Dolt databases to other users and work together. High levels of collaborative activity score higher in our rankings.
Scale. The scale of your database matters. Larger and more comprehensive databases are viewed as valuable resources and rank higher.

Given the criteria mentioned, here are the highest-ranked databases on DoltHub as of now. They have a number of stars and forks, as well as recent updates.

You can also switch to sort the databases by either the most recent updates or the highest number of stars through the dropdown.

Behind the Scenes of Our Ranking System

Calculating the ranking of our databases is an expensive operation. Instead of ranking databases every time someone visits the Discover page, we run a background job every night to update the ranking for each database on DoltHub. This process calculates a composite score based on the factors mentioned above. Databases are constantly growing and changing through user contributions on DoltHub. Interesting databases are emerging on DoltHub every day. Our nightly updates ensures it reflects the continuous changes and contributions.

How the Score Is Calculated

To give you an idea of how we calculate each database's score, let’s break down the formula used to calculate the score:

weight =  Forks Count * Forks Multiplier +  Stars Count * Stars Multiplier + Pull Requests Count + Jobs Count +  Collaborators Count  + Write Recency Score  + Size Score

We value popularity, which is why forks and stars have multipliers in the score.

Exploring the best methods for calculating write recency and size scores have been a fun journey. Below we will go more in-depth into how we arrived at these scores.

Write Recency Score

We store the last write timestamp of each database, which is used to sort databases by recency. To translate it into a recency score, we calculate the number of hours elapsed from ten days ago to the last write timestamp. If the last update occurred over ten days ago, the recency score defaults to 0. We tested various thresholds: 30 days, 15 days, and finally arrived at 10 days as it offered a more balanced result.

    tenDaysAgo := time.Now().Add(-10 * 24 * time.Hour)

    writeRecencyScore := LastWriteTimestamp.Sub(tenDaysAgo).Hours()
    if writeRecencyScore < 0 {
		writeRecencyScore = 0
    }

This approach prioritizes databases that have been recently updated and gives them a higher score. Some databases, despite not being updated for an extended period, remain valuable and frequently accessed. To balance that, we avoid penalizing these databases for inactivity by assigning a recency score of zero to any database that hasn't been updated in the last 10 days. This method ensures a fair evaluation, recognizing both the importance of fresh content and the value of established databases.

Scale Score

The ideal way to evaluate the scale of the database would be to use the number of total cells instead of the size in bytes. However, storing and updating cell counts is challenging, so we use the size of the database as a proxy.

Initially, we calculated the score by converting the size to MB. However, the sizes of databases on DoltHub range from a few KB to several TB and this method disproportionately favored very large databases. To address this, we adjusted the size score to apply a logarithmic weighting, which becomes negative for sizes below a certain threshold, such as 1MB.

func logarithmicWeight(size int64) float64 {
	const threshold = 1024 * 1024 // 1MB
	if size == threshold {
		return 0
	}
	sign := float64(1)
	if size < threshold {
		sign = -1
	}

	return sign * math.Log(float64(math.Abs(float64(size-threshold)))+1)
}

This approach balances the impact of database size on the overall score, ensuring that both large and small databases are fairly ranked based on their actual use and popularity, not just their size.

Conclusion

Our goal with this ranking system is to foster a collaborative data community on DoltHub, where databases that contribute significantly to the community are featured. By shedding light on the metrics of our ranking system, we hope to encourage users to optimize their databases for better visibility on DoltHub. If you have an interesting use case, file a feature request or reach us on our Discord!

Blog