By Dave Bergstein, InfoWorld |

About |

Emerging tech dissected by technologists

Solving complex problems with vector databases

Vector databases unlock the insights buried in complex data including documents, videos, images, audio files, workflows, and system-generated alerts. Here’s how.

The world of data is rapidly changing around us, yet many companies are reacting slowly to the trends. Experts predict that by 2025, 80% or more of all data will be unstructured, but a survey by Deloitte suggests that only 18% of organizations are prepared to analyze unstructured data. This means that the vast majority of companies are not able to utilize the better part of the data in their possession, and it all comes down to having the right tools.

A lot of that data is fairly straightforward. Keywords, metrics, strings, and structured objects like JSON are relatively simple. Traditional databases can organize these kinds of data, and many basic search engines can help you search through them. They help you efficiently answer relatively simple questions:

Which documents contain this set of words?
Which items meet these objective filtering criteria?

More complex data are significantly more difficult to interpret, but they are also more interesting and may unlock more value to the business by answering more sophisticated questions like:

What songs are similar to a sample of “liked” songs?
What documents are available on a given subject?
Which security alerts need attention and which can be ignored?
Which items match a natural language description?

Answering questions like these often requires more complex, less structured data including documents, passages of plain text, videos, images, audio files, workflows, and system-generated alerts. These forms of data do not easily fit into traditional SQL-style databases and they may not be discoverable by simple search engines. To organize and search through these kinds of data, we need to convert the data to formats that computers can process.

The power of vectors

Fortunately, machine learning models allow us to create numeric representations of text, audio, images, and other forms of complex data. These numeric representations, or vector embeddings, are designed so that semantically similar items map to nearby representations. Two representations are near or far depending on the angle or distance between them, when viewed as points in high-dimensional space.

Machine learning models allow us to interact with machines more similarly to how we interact with humans. For text, this means users can ask natural language questions — the query is converted into a vector using the same embedding model that converted all of the search items into vectors. The query vector is then compared to all of the object vectors to find the nearest matches. In the same way, image or audio files can be transformed into vectors that allow us to search for matches based on the nearness (or mathematical similarity) of their vectors.

Today, you can convert your data to vectors more easily than even just a few years ago thanks to several vector transformer models available that perform well and often work as-is. Sentence and text transformer models like Word2Vec, GLoVE, and BERT are excellent general-purpose vector embedders. Images can be embedded using models such as VGG and Inception. Audio recordings can be transformed into vectors using image embedding transformations over the audio frequency’s visual representation. These models are all well-established and can be fine-tuned for special applications and knowledge domains.

With vector transformer models readily available, the question shifts from how to convert complex data into vectors, to how do you organize and search for them?

Enter vector databases. Vector databases are specifically designed to work with the unique characteristics of vector embeddings. They index data in a way that makes it easy to search and retrieve objects according to their numerical values.

What is a vector database?

At Pinecone, we define a vector database as a tool that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like metadata filtering and horizontal scaling. Vector embeddings, or vectors, as we mentioned earlier, are numerical representations of data objects. The vector database organizes vectors so that they can be quickly compared to one another or to the vector representation of a search query.

Vector databases are specifically designed for unstructured data and yet provide some of the functionality you’d expect from a traditional relational database. They can execute CRUD operations (create, read, update, and delete) on the vectors they store, provide data persistence, and filter queries by metadata. When you combine vector search with database operations, you get a powerful tool with many applications.

While this technology is still emerging, vector databases already power some of the largest tech platforms in the world. Spotify offers personalized music recommendations based on liked songs, listening history, and similar musical profiles. Amazon uses vectors to recommend products that are complementary to items being browsed. Google’s YouTube keeps viewers streaming on their platform by serving up new relevant content based on similarity to the current video and viewing history. Vector database technology has continued to improve, offering better performance and more personalized user experiences for customers.

Today, the promise of vector databases is within reach for any organization. Open-source projects help organizations who want to build and maintain their own vector database. And managed services help companies who seek to outsource this work and focus their attention elsewhere. In this article, we will explore important features of vector databases and the best ways to use them.

Common applications for vector databases

Similarity search or “vector search” is the most common use case for vector databases. Vector search compares the proximity of multiple vectors in the index to a search query or subject item. In order to find similar matches, you convert the subject item or query into a vector using the same machine learning embedding model used to create your vector embeddings. The vector database compares the proximity of these vectors to find the closest matches, providing relevant search results. Some examples of vector database applications:

Semantic search. You generally have two options when searching text and documents: lexical or semantic search. Lexical search looks for matches of strings of words, exact words, or word parts. Semantic search, on the other hand, uses the meaning of a search query to compare it to candidate objects. Natural language processing (NLP) models convert text and whole documents into vector embeddings. These models seek to represent the context of words and the meaning they convey. Users can then query using natural language and the same model to find relevant results without having to know specific keywords.
Similarity search for audio, video, images, and other types of unstructured data. These data types are hard to characterize well with structured data compatible with traditional databases. An end user may struggle to know how the data was organized or what attributes would help them identify the items. Users can query the database using similar objects and the same machine learning model to more easily compare and find similar matches.
Deduplication and record matching. Consider an application that removes duplicate items from a catalog, making the catalog more usable and relevant. Traditional databases can do this if the duplicate items are organized similarly and register as a match. But this isn’t always the case. A vector database allows one to use a machine learning model to determine similarity, which can often avoid inaccurate or manual classification efforts.
Recommendation and ranking engines. Similar items often make for great recommendations. For example, consumers often find it helpful to see similar or suggested products, content, or services for comparison. It may help a consumer discover a new product he or she wouldn’t have otherwise found or considered.
Anomaly detection. Vector databases can find outliers that are very different from all other objects. One may have a million diverse but expected patterns, whereas an anomaly may be anything sufficiently different than any one of those million expected patterns. Such anomalies can be very valuable for IT operations, security threat assessments, and fraud detection.

Key capabilities of vector databases

Vector Indexing and Similarity Search

Vector databases use algorithms specifically designed to index and retrieve vectors efficiently. They use “nearest neighbor” algorithms to assess the proximity of similar objects to one another or a search query. You can compute the distances between a query vector and 100 other vectors fairly easily. Computing the distances for 100M vectors is another story.

Approximate nearest neighbor (ANN) search solves the latency problem by approximating and retrieving the best guess of similar vectors. ANN doesn’t guarantee an exact set of best matches, but it balances very good accuracy with much faster performance. Some of the most well-used techniques for building ANN indexes include hierarchical navigable small worlds (HNSW), product quantization (PQ), and inverted file index (IVF). Most vector databases use a combination of these to produce a composite index optimized for performance.

Single-stage filtering

Filtering is a useful technique for limiting search results based on chosen metadata to increase relevance. This is typically done either before or after a nearest neighbor search. Pre-filtering shrinks the dataset first, before the ANN search, but this is typically incompatible with leading ANN algorithms. One workaround is to shrink the dataset first and then perform a brute-force exact search. Post-filtering shrinks the results after the ANN search across the whole dataset. Post-filtering leverages the speed of ANN algorithms, but may not return enough results. Consider a case where the filter down-selects only a small number of candidates that are unlikely to be returned from a search across the whole dataset.

Single-stage filtering combines the accuracy and relevance of pre-filtering with ANN speed nearly as fast as post-filtering. By merging vector and metadata indexes into a single index, single-stage filtering offers the best of both approaches.

API

Like many managed services, you and your applications typically interact with the vector database by API. This allows your organization to focus on their own applications without having to worry about the performance, security, and availability challenges of managing their own vector database.

API calls make it easy for developers and applications to upload data, query, fetch results, or delete data.

Hybrid storage

Vector databases typically store all of the vector data in memory for fast query and retrieval. But for applications with more than a billion search items, memory costs alone would stall many vector database projects. You could instead opt to store vectors on disk, but this usually comes at the cost of high search latencies.

With hybrid storage, a compressed vector index is stored in memory, and the complete vector index is stored on disk. The in-memory index can narrow the search space to a small set of candidates within the full-resolution index on disk. Hybrid storage allows you to store more vectors across the same data footprint, lowering the cost of operating your vector database by improving overall storage capacity without negatively impacting database performance.

Insights into complex data

The landscape of data is ever-evolving. Complex data is growing rapidly and most organizations are ill-equipped to analyze it. The traditional databases that most companies already have in place are ill-suited to handle this type of data, and so there is a growing need for new ways to organize, store, and analyze unstructured data. Solving complex problems requires being able to search for and analyze complex data.

And the key to unlocking the insights of complex data is the vector database.

Dave Bergstein is director of product at Pinecone. Dave previously held senior product roles at Tesseract Health and MathWorks where he was deeply involved with productionalizing AI. Dave holds a PhD in electrical engineering from Boston University studying photonics. When not helping customers solve their AI challenges, Dave enjoys walking his dog Zeus and crossfit.

—

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Next read this:

How to choose a low-code development platform