Guide to JaguarDB Vector Database API
Vector databases find application in NLP tasks such as document similarity, sentiment analysis, and semantic search. Textual data can be transformed into high-dimensional vector representations, enabling efficient indexing and retrieval.
Document Similarity and Clustering: Vector databases can be used to measure the similarity between documents. Textual data is transformed into vector representations, such as word embeddings or document embeddings. By comparing the vectors of different documents, the database can identify similar content, cluster related documents, and enable efficient document search.
Sentiment Analysis: Vector databases can aid in sentiment analysis, where the sentiment or emotion expressed in text is determined. Textual data is transformed into vectors, and sentiment analysis algorithms can be applied to analyze the sentiment associated with different vectors. This can be useful in social media monitoring, customer feedback analysis, or brand reputation management.
Semantic Search: Vector databases enable semantic search capabilities, allowing users to find documents or passages related to specific concepts rather than just keyword matches. By representing text as vectors, the database can perform similarity calculations and retrieve documents with similar semantic meaning, even if the wording differs.
JaguarDB offers a comprehensive set of application programming interfaces (APIs) tailored to various development needs. These APIs can be seamlessly employed within the jql.bin client terminal or seamlessly integrated into programming languages such as Java, Python, Go, and Node.js. This flexibility empowers developers to interact with JaguarDB using their preferred environment, ensuring a smooth and versatile development experience.
Creating a Table for Vectors
create table TABLE (
key: …KEY…,
value: VECCOL vector(dimension,'DISTANCE_INPUT_QUANTIZATION'),
…other_fields…
)
The symbol “VECCOL” designates the name of the vector column, while “dimension” denotes the count of components within a vector. Standard dimensions often include values like 768, 1024, 1536, etc. The string “DISTANCE_INPUT_QUANTIZATION” is a vector definition that serves to specify the nature of the distance, input data type, and level of quantization employed in the vector storage and search of similarity between vectors. This comprehensive approach accommodates various distance types, which encompass:
Euclidean Distance
The Euclidean distance, also known as the L2 distance or the Euclidean norm, is a measure of the straight-line distance between two points in a multi-dimensional space. It’s commonly used to quantify the similarity between vectors.
Cosine Distance
Cosine distance is a measure used to quantify the dissimilarity between two vectors in a multi-dimensional space. Unlike the Euclidean distance that measures the direct geometric distance between vectors, the cosine distance focuses on the angle between the vectors.
InnerProduct
Inner product similarity is useful for similarity search in scenarios where the magnitudes of vectors are important in addition to their directions.
Manhatten Distance
Manhattan distance is a distance metric between two points in a multi-dimensional vector space. It is the sum of absolute difference between the measures in all dimensions of two points.
Chebyshev Distance
Chebyshev distance is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension.
Hamming Distance
The Hamming distance between two vectors is the number of positions at which the corresponding components are different.
Jeccard Distance
The Jeccard distance between two vectors is computed by taking the ratio of Intersection over Union of the two vectors.
Minkowski Half
In JaguarDB, Minkowski Half distance refers to the Minkowski distance where p = 0.5.
The input type in JaguarDB refers to the expected data format in the input vectors. There are two input types: fraction and whole. JaguarDB excels not only in managing vector embeddings but also in handling a diverse range of feature vectors. These vectors can include various types and forms, whether they are normalized or unnormalized, presented in fractional or full original formats. This versatility underscores JaguarDB’s capability to accommodate a wide array of data formats.
Adding Vectors
JaguarDB can integrate all application and vector data, facilitating streamlined data management for real-world scenarios. It enables the incorporation of vector data alongside other pertinent information related to business objects, allowing for comprehensive and cohesive data representation.
insert into TABLE ( …, VECCOL, …) values (…, 'VECTOR_STRING', … )
insert into TABLE values (…, 'VECTOR_STRING', … )
Where VECTOR_STRING is a list of comma-separated components of the vector. In the second statement, the values must be provided according to the correct order of the columns in the table. Once the vector is added, the value of the field for VECCOL will be replaced with an integer as the unique identifier for the vector. With a vector ID, the components of the vector can be retrieved from the vector database.
Similarity Search
Similarity search using JaguarDB vectors involves the process of finding vectors within the database that are most similar to a given query vector. This search is conducted based on predefined similarity metrics, such as cosine similarity or Euclidean distance similarity, which quantify the resemblance between vectors. The API for similarity search is as follows:
select
similarity(v, 'QUERY_VECTOR',
'topk=K,type=DISANCE_INPUT_QUANTIZATION')
from TABLE;
where QUERY_VECTOR is a list of comma-separated component values of the vector. The number “K” specifies the number of most similar vectors to be found and returned for the query vector. The returned result is in the JSON format and the developer can call the jsonString() function to parse the JSON format and retrieve the ID and distance values.
As an example, the following statement returns the top 5 most similar vectors to the query vector:
select similarity(v, '0.1, 0.2, 0.3, 0.4, 0.5, 0.3, 0.1',
'topk=5,type=manhatten_fraction_byte') from vec1;
Combining Vector Search and Exact Search
JaguarDB empowers users with a unique synergy of similarity search and exact predicate search. In the context of this integration, consider the following Python illustration: it finds textual instances similar to a given input text while concurrently sifting through records that adhere to specific criteria. The outcome of this combined endeavor is the assignment of similarity values to the retrieved records, a direct consequence of the similarity search’s operation. It is noted that the governing criterion, in this case, relates to the source of the text. However, in practical implementation, a number of predicates can be applied.
select
similarity(v, 'QUERY_VECTOR',
'topk=K,type=DISANCE_INPUT_QUANTIZATION')
from TABLE
where ...;
An example of integrating both similarity search and predicate based search is shown below:
def retrieveTopKWithCriteria( jag, model, queryText, src, K ):
print("Query: " + queryText )
sentences = [ queryText ]
embeddings = model.encode(sentences, normalize_embeddings=False)
comma_str = ",".join( [str(x) for x in embeddings[0] ])
qstr = "select similarity(v, '" + comma_str
qstr += "', 'topk=" + str(K) + ",type=cosine_fraction_short')"
qstr += " from textvec"
qstr += " where source='" + src + "'"
jag.query( qstr )
print("\n")
print("Result: ")
while jag.reply():
print('zid={}'.format(jag.getValue("zid")) )
print('v={}'.format(jag.getValue("v")) )
print('vectorid={}'.format(jag.getValue("vectorid")) )
print('rank={}'.format(jag.getValue("rank")) )
print('distance={}'.format(jag.getValue("distance")) )
print('source={}'.format(jag.getValue("source")) )
print('text={}'.format(jag.getValue("text")) )
print("\n")
Retrieving Vectors
In cases where users need to retrieve the component values of a vector, the following API can be used:
select
vector(VECCOL, 'type=DISTANCE_INPUT_QUANTIZATION')
from TABLE
where KEY=…
For example,
select vector(v, 'type=manhatten_fraction_short')
from vec1
where fid='ANjf84HBCWos88223@001'
The utilized KEY in the query must uniquely identify a record housing the vector, typically involving the exclusive use of the ZeroMove unique ID.
Updating Vectors
The vector components can be updated with two approaches:
update TABLE
set VECCOL:vector='VECTOR_STRING'
where KEY=…
update TABLE
set VECCOL:vector='VECTOR_ID:VECTOR_STRING'
where 1
where VECTOR_ID is the integer value of the vector ID, and VECTOR_STRING is a list of comma-separated component values.
Deleting Vectors
The vector components cannot be deleted separately without deleting the record containing the vector. A table record can be deleted with the following command:
Delete from TABLE
where KEY=…
The KEY in the above statement must uniquely identify a record housing the vector, typically the ZeroMove unique ID. In addition, dropping or truncating a table will delete the associated vectors as well.
Conclusion
In this article, we presented an instructive guide to programming with JaguarDB’s API. The focus of this guide centers around the seamless utilization of JaguarDB’s capabilities for vector management. The scope encompasses various essential operations such as table creation, vector storage, similarity searches, multifaceted explorations, dynamic updates, and precision deletions. This collective endeavor delves into the intricacies of harnessing JaguarDB’s API to facilitate a robust ecosystem for vector-related functionalities.