The emergence of generative AI has recently initiated discussions & created both excitement and concerns (see an open letter) among technologists. ChatGPT, Midjourney, Dreamfusion, and the eagerly anticipated GPT-4 have already established their fame for performing at a “human level”, capturing significant attention.
That swift evolution of various AI models and the ready availability of APIs from OpenAI and other similar platforms have attracted early adopters to develop AI-based applications that can quickly navigate vast, often unstructured datasets. However, the most advanced AI models will not be operationalised and made accessible to users without the right form of database and search - one that can adequately extract the most valuable insights encapsulated in e.g. long-word sentences. Unlike our current conventional methods, which are limited to retrieving information based solely on isolated keywords, soon this new, emerging, almost intelligent search will be offering a way to discover previously unattainable information!
The potential of this technology is immense, with far-reaching implications for various sectors. From finance and healthcare to education and entertainment, the “intelligent search”, aka similarity search (a $1T opportunity) can revolutionise how we interact with information and make decisions!
How do we get that “intelligent search”?
The proliferation of data sources has created a significant challenge, particularly with unstructured data like documents, images, videos, and plain text found on the web, historically too difficult to store. Such data is just too complex for most traditional structured databases (relying on keywords and metadata as classifiers, they fail to deliver all of its various characteristics). However, thanks to machine learning advancements, we can better represent complex data by transforming it into vector embeddings, a format that can be analysed more easily.
So, what are these vector embeddings:
Despite being defined as an array of numbers, vectors have the remarkable ability to describe complex objects such as word sentences, images, or audio files in a continuous, high-dimensional space known as an embedding.
This approach provides a more precise and comprehensive technique for storing and analysing unstructured data, making it one of the most useful concepts in the field of machine learning. This is especially true in domains such as recommendation systems or search engines and text generation like ChatGPT, where vector embeddings play a critical role.
How do we obtain these vector embeddings?
Vector embeddings are a byproduct of AI models, specifically deep learning models that are trained on large sets of input data. These embeddings correspond to specific content and are later used for conducting semantic similarity searches. To manage and search these embeddings effectively, engineers utilise vector databases.
Unlike traditional relational databases with rows and columns or document databases with documents and collections, vector databases cluster numerical arrays based on similarity. This unique feature enables ultra-fast querying with low latency!
Vector databases are becoming increasingly important in various applications that rely on natural language processing (NLP) or large volumes of text data. From semantic search to question answering, these databases offer the most powerful solutions helping companies to unlock insights and provide the most accurate and efficient data processing. So far, we have seen the emergence of several companies, with the latest addition being Chroma from San Francisco. Founded by Jeff Hube & Anton Troynikov, the company (based on ClickHouse) is an innovative open-source tool, certainly making some waves in the industry!
SQL database vs vector database (quick summary)
As mentioned shortly above, traditional SQL databases are useful for storing information about individual items. They often fall short when it comes to identifying similar items based on user preferences: colour, size, material etc. This is because SQL databases lack the ability to understand the concept of similarity, making it difficult to provide accurate recommendations to users. In this context, the emergence of vector databases has revolutionised the way we store and analyze data, enabling businesses to offer more personalised recommendations and improve the overall customer experience.
In order to stay ahead of the competition in today’s business landscape, it has become imperative to make data-driven decisions! However, with the explosive growth of data, particularly unstructured data (as per Bloomberg’s report, the global artificial intelligence market, which is fueled by data, is projected to expand at a compound annual growth rate of 39.4% and reach $422.37 billion by 2028), it is becoming increasingly clear that relying on conventional solutions may no longer be viable. While the models like GTP-4 are deployed by early adopters, it is only a matter of time before all modern companies start harnessing the power of their exponentially increasing volumes of unstructured data.
Though vector databases are very much emerging as a powerful tool, adopting them can help businesses improve their recommendations' accuracy, enhance the user experience, and gain a competitive edge in their respective industries. Overall, it is clear that vector databases are a promising technology that will continue to shape the future of data storage and analysis, incl. search!
We (Inreach Ventures) are deeply committed to partnering with entrepreneurs who are reinventing the flow of data on a broad scale! While we are extremely passionate about the space, we plan to share even more of our thoughts on this rapidly evolving field in upcoming articles. If you are currently involved in developing a data infrastructure startup or working for a company driving this transformation, do not hesitate to reach out.