Anomaly Detection in JaguarDB

Jonathan Yue, PhD
3 min readDec 21, 2023

--

Jaguar vector database is pioneering the way data scientists approach anomaly detection. It provides a structured and efficient means of storing and querying data, enabling organizations to analyze patterns and deviations with remarkable precision. This innovative technique not only streamlines the process of anomaly detection but also enhances the accuracy of identifying potential threats. As the business landscape continues to evolve in an increasingly digital world, leveraging vector databases for anomaly detection has become a strategic imperative for enterprises seeking to safeguard their operations and data from malicious activities.

The API for analyzing anomaly is shown below:

select
anomaly(VECCOL,
'type=DISTANCE_INPUT_QUANTIZATION,slices=N')
from pod.store

where the type specifies the distance type and quantization levels of vectors; the parameter slices is the number of slices that divides 4-standard deviation span of the distribution of all the vectors in the vector store. The default value of slices is 20. Some examples of anomaly detection are listed as follows:

select
anomaly(vc,
'type=euclidean_whole_float')
from myvector;

select
anomaly(vc,
'type=euclidean_whole_float, slices=30')
from myvector;

The result would look like a list of Json objects:

[{"sigma":"0.1","prate":"0.9"},{"sigma":"0.3","prate":"0.9"},{"sigma":"0.5","prate":"0.7"},{"sigma":"0.7","prate":"0.7"},{"sigma":"0.9","prate":"0.7"},{"sigma":"1.1","prate":"0.6"},{"sigma":"1.3","prate":"0.6"},{"sigma":"1.5","prate":"0.6"},{"sigma":"1.7","prate":"0.6"},{"sigma":"1.9","prate":"0.6"},{"sigma":"2.1","prate":"0.5"},{"sigma":"2.3","prate":"0.4"},{"sigma":"2.5","prate":"0.4"},{"sigma":"2.7","prate":"0.4"},{"sigma":"2.9","prate":"0.4"},{"sigma":"3.1","prate":"0.4"},{"sigma":"3.3","prate":"0.4"},{"sigma":"3.5","prate":"0.4"},{"sigma":"3.7","prate":"0.2"},{"sigma":"3.9","prate":"0.2"}]

Graphically the result can be viewed as the following:

Histogram of percentage of vector components and sigma values

The histogram graph displays the proportion of vector components that deviate beyond multiples of standard deviation (sigma) values within their distribution. For instance, at a sigma value of 1.7, approximately 55% of the vector components are observed to fall outside the range of 1.7 standard deviations from the mean of their respective vector components. With such insights, one can evaluate the anomaly of vectors and determine the threshold values to classify vectors as anomalous, which is describe as follows.

Often a vector data is checked to see if it is anomalous from the main body of dataset. The API for detecting anomalousness is shown below:

select
anomalous(VECCOL,
'type=DISTANCE_INPUT_QUANTIZATION,activation=[sigma1:percent1&sigma2:percent2&sigma3:percent3&…]')
from pod.store

where the type specifies the distance type and quantization levels of vectors; the parameter activation specifies one or more “sigma:percent” pairs instructing how much percent of vector components must be greater than the sigma value in order to be classified as anomalous. For instance, if activation is “0.3:70&1:50&1.5:30&3:10”, then at 0.3 times of standard deviation(Sigma), there must be more than 70 percent of vector components that exceed this 0.3*sigma value; and at one Sigma, there must be more than 50 percent of vector components that exceed this one sigma value; and finally at three Sigma, there must be more than 10 percent of vector components that exceed this one sigma value. If any condition is not met, then the query vector is not classified as an anomalous vector.

select
anomalous(vc,
'type=euclidean_whole_float,activation=[0.3:60&1:50&1.5:30&3:10]')
from pod.mystore;

The result is a Json string telling if “anomalous” is YES or NO.

{"anomalous":"YES","percent":"74&60&40&12","activation":"0.3:60&1:50&1.5:30&3:10"}

The anomaly detection feature in Jaguar Vector Database is exceptionally beneficial, offering critical insights into vector distributions for identifying anomalous data or outliers. This functionality is particularly valuable in various applications such as fraud detection, network attack identification, data cleansing, and preparation of high-quality, noise-filtered data for machine learning. Additionally, it plays a crucial role in monitoring and securing real-time business transactions, making it an indispensable tool in these contexts.

--

--

Jonathan Yue, PhD

Enthusiast on vector databases, AI, RAG, data science, consensus algorithms, distributed systems. Initiator and developer of the JaguarDB vector database