Taking a closer look at the Consistent Hashing algorithm

Jonathan Yue, PhD
5 min readJun 8, 2023

--

Consistent hashing is a widely employed technique in distributed systems, where a hash function is used to map data items and nodes onto a ring-shaped continuum. Each node is allocated a range of hash values on the ring, and data items are assigned to the node whose hash value range encompasses the hash value of the data item. When a new node is added or an existing node is removed, only the affected nodes’ ranges need adjustment. This strategy reduces data movement, enhancing system efficiency and scalability. However, the challenge of data migration has not been fully resolved, and subsequent scaling processes may still require significant data relocation, leading to disruptions and potential performance bottlenecks.

The process of consistent hashing can be described using the following mathematical expressions:

  • Hash Function (H): A hash function H maps a key to a value within the range [0, 1].
  • Node Positions: Each node in the distributed system is represented by a point on a ring, also known as a continuum, with a circumference of 1. The position of a node on the ring is determined by a hash function R, which maps the node’s identifier to a value in the range [0, 1]. The node is responsible for all keys that map to a point on the ring between its own position and the position of its successor node in clockwise order.
  • Key Mapping: To determine which node is responsible for a given key, we apply the hash function H to the key, resulting in a value within the range [0, 1]. We then locate the position of the node on the ring that immediately follows this value in clockwise order, and that node becomes responsible for the key.
  • Adding or Removing Nodes: When a new node is added to the system, its position on the ring is determined by the hash function R. Keys that were previously assigned to the node’s successor are reassigned to the new node. Conversely, when a node is removed from the system, keys that were previously assigned to that node are reassigned to its successor.
  • Virtual Nodes: To improve load balancing, virtual nodes can be created from a single node. These virtual nodes are positioned on the ring similar to the original node. If a key is mapped to a virtual node on the ring, it will eventually be stored on the original node. The number of virtual nodes assigned to a node acts as a weight in load balancing.

Rendezvous hashing, a technique related to consistent hashing, enables data replication with multiple copies (r copies) among a set of N nodes. Unlike consistent hashing, rendezvous hashing does not require precomputing or storing tokens. However, it necessitates re-computing N hash values for every read and write operation, which can become costly as N grows larger. Consistent hashing is equivalent to rendezvous hashing when the number of sites is equal to one. Both consistent hashing and rendezvous hashing exhibit similar patterns of data migration volume during scaling processes. In both approaches, adding nodes to the system requires redistributing data across the nodes.

In a distributed system that employs consistent hashing, a significant portion of data records generated and stored in the system undergo migration from one node to another. This migration is a consequence of a series of incremental scaling processes that the system undergoes, where nodes are added or removed. Due to the inherent randomness and distribution of the hashing function, data records may be reassigned to different nodes, resulting in their migration throughout the system. This dynamic nature of consistent hashing ensures load balancing and efficient utilization of resources, albeit at the cost of frequent data movement. This phenomenon can be mathematically proven and analyzed within the context of consistent hashing.

Fraction of data undergoing redistribution in consistent hashing

In the above equation, N is the total number of incremental scaling processes, m is the total number of nodes introduced each time the system is scaled, Z₀ is the initial number of nodes in the system, p(N) is the decimal fraction of all data that will undergo migration in consistent hashing. As N gets large, p(N) approaches 1. This observation signifies that as the system undergoes continuous scaling out, eventually, every piece of data will be moved to other nodes.

The extensive and frequent data migration inherent in consistent hashing introduces overhead and comes with various costs for the system. These costs include heightened power consumption, increased wear and tear on hardware components, and degraded system performance.

The continuous movement of data across nodes necessitates network traffic, data transfer, and disk I/O operations, leading to increased power consumption and utilization of system resources. The constant shuffling of data also puts additional strain on the hardware, potentially leading to faster deterioration and increased maintenance requirements.

Furthermore, the process of data migration itself introduces delays and interruptions in system operations. The time required to redistribute data among nodes can impact the overall system performance, especially when dealing with large-scale systems or handling real-time workloads. These performance degradations can manifest as increased latency, reduced throughput, and potential bottlenecks in the system.

In response to the challenge of data migration during the scaling of distributed systems, we introduce a new framework called ZeroMove hashing. This innovative approach aims to eliminate the need for any data to be moved during the scaling process. By leveraging the power of randomness and intelligent data placement, the ZeroMove hashing framework minimizes or eliminates the complexities and time-consuming tasks associated with data relocation.

With ZeroMove hashing, data is intelligently assigned to nodes in clusters, thereby reducing or eliminating the need for extensive data movement. This eliminates the risks of downtime and other issues that can arise if data migration is not managed carefully.

The benefits of ZeroMove hashing are significant. Firstly, it simplifies the scaling operations by reducing the complexity and time required for data migration. This ensures a smoother scaling process with minimal disruption to system operation. Additionally, by eliminating data movement, it mitigates the potential performance degradation that may occur during scaling, resulting in improved system efficiency and reduced resource utilization.

The ZeroMove hashing framework provides an effective solution for managing scaling in distributed systems, offering optimized resource utilization and enhanced overall system performance. By incorporating intelligent data placement, it introduces a novel approach that eliminates the challenges associated with data migration during scaling processes. In the upcoming article, I will delve into the details of the ZeroMove hashing technique.

--

--

Jonathan Yue, PhD

Enthusiast on vector databases, AI, RAG, data science, consensus algorithms, distributed systems. Initiator and developer of the JaguarDB vector database