Tuesday, December 19, 2023

Leveraging Llamaindex, Ollama, and Weaviate for RAG Applications in Controlled Environments

Tired of OpenAI's limitations for private data and eager to experiment with RAG on my own terms, I dove headfirst into a holiday quest: building a local, OpenAI-free RAG application. While countless tutorials guide Full Stack development, the "AI" magic often relies on OpenAI APIs, leaving private data concerns unresolved. So, fueled by frustration and holiday spirit, I embarked on a journey to forge my own path, crafting a RAG that would sing offline, on my own machine.

This post shares the hard-won wisdom from my quest, hoping to guide fellow explorers building RAGs in their own kingdoms. Buckle up, and let's delve into the challenges and triumphs of this offline adventure!

Retrieval-Augmented Generation (RAG) in Controlled Environments

There are several advantages to running a Large Language Model (LLM), Vector Store, and Index within your own data center or controlled cloud environment, compared to relying on external services:

  1. Data control: You maintain complete control over your sensitive data, eliminating the risk of unauthorized access or leaks in third-party environments.
  2. Compliance: Easily meet compliance requirements for data privacy and security regulations specific to your industry or region.
  3. Customization: You can fine-tune the LLM and index to be more secure and privacy-preserving for your specific needs.
  4. Integration: Easier integration with your existing infrastructure and systems.
  5. Potential cost savings: Although initial setup might be higher, running your own infrastructure can be more cost-effective in the long run, especially for high-volume usage.
  6. Predictable costs: You have more control over budgeting and avoid unpredictable scaling costs of external services.
  7. Independence: Reduced reliance on external vendors and potential risks of vendor lock-in.
  8. Innovation: Facilitates research and development of LLMs and applications tailored to your specific needs.
  9. Transparency: You have full visibility into the operation and performance of your LLM and data infrastructure.


Traditionally, training a base model is the most expensive stage of AI development. This expense is eliminated by using a pre-trained language model (LLM), as proposed in this post. Owning and running this setup will incur costs comparable to any other IT application within your organization. To illustrate, the sample application below runs on a late-2020 Macbook Air with an M1 chip and generates responses to queries within 30 seconds.


Let's look at a RAG application and its data integration points before we identify potential points of sensitive data leakage. 

Source: https://docs.llamaindex.ai/en/stable/getting_started/concepts.html


When using a RAG pipeline with an external API like OpenAI, there are several points where your sensitive data could potentially be compromised. Here are some of the key areas to consider:


Data submitted to the API:

  • Query and context: The query itself and any additional context provided to the API could contain personally identifiable information (PII) or other sensitive data.
  • Retrieved documents: If the RAG pipeline retrieves documents from an corporate knowledge base, those documents might contain PII or sensitive information that gets incorporated into the Index, and transmitted to the external LLM API to generate the answer.

Transmission and storage:

  • Communication channels: Data transmitted between your system and the external API might be vulnerable to interception if not properly secured with encryption protocols like HTTPS.
  • API logs and storage: The external API provider might store logs containing your queries, contexts, and retrieved documents, which could potentially be accessed by unauthorized individuals or leaked in security breaches.

Model access and outputs:

  • Model access control: If the external API offers access to the underlying LLM model, it's crucial to ensure proper access controls and logging to prevent unauthorized use that could potentially expose sensitive data.
  • Generated text: Be aware that the LLM might still include personal information or sensitive content in its generated responses, even if the query itself didn't explicitly contain it. This can happen due to biases in the LLM's training data or its imperfect understanding of context.


The quest for private, accurate and efficient search has led me down many winding paths, and recently, three intriguing technologies have emerged with the potential to revolutionize how we interact with information: LlamaIndex, Ollama, and Weaviate. But how do these tools work individually, and how can they be combined to build a powerful Retrieval-Augmented Generation (RAG) application? Let's dive into their unique strengths and weave them together for a compelling answer.


1. llamaindex: Indexing for Efficiency

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models. 

Imagine a librarian meticulously filing away knowledge in an easily accessible system. That's essentially what LlamaIndex does. It's a lightweight, on-premise indexing engine that excels at extracting dense vector representations from documents like PDFs, emails, and code. It operates offline, ensuring your data remains secure and private. Imagine feeding LlamaIndex a corpus of scientific papers – it would churn out a dense index, ready for lightning-fast searches.

Thursday, December 14, 2023

Leveraging Retrieval-Augmented Generation (RAG) for Enterprise AI

 In today's data-driven landscape, enterprises are increasingly seeking to leverage the power of artificial intelligence (AI) to unlock new insights and automate tasks. However, commercial SaaS AI models often struggle to handle the specific data (usually hidden behind firewalls) and nuances of large organizations. This is where retrieval-augmented generation (RAG) comes in.

RAG is a powerful technique that augments the knowledge of large language models (LLMs) with additional data, enabling them to reason about private information and data that was not available during training. This makes RAG particularly valuable for enterprise applications, where sensitive data and evolving business needs are the norm.


In one of my recent project proposals, I advocated for the implementation of RAG pipelines across various business units within a large enterprise client. These types of initiatives have the potential to revolutionize the way enterprises utilize AI, enabling the them to:

  • Unlock insights from private data: RAG can access and process confidential data, allowing us to glean valuable insights that were previously out of reach.
  • Improve model accuracy and relevance: By incorporating domain-specific data into the RAG pipeline, we can ensure that the generated outputs are more accurate and relevant to the specific needs of each business unit.
  • Boost model efficiency: RAG can help to reduce the need for extensive data retraining, as the model can leverage its existing knowledge and adapt to new information on the fly.
  • Future-proof AI applications: By continuously incorporating new data into the RAG pipeline, we can ensure that our AI models remain up-to-date and relevant in the ever-changing business landscape.

Tuesday, December 12, 2023

Mixtral: A High-Quality Sparse Mixture-of-Experts Model with Open Weights by Mistral AI

Mistral AI just announced Mixtral, a new open-source AI model - https://mistral.ai/news/mixtral-of-experts. Mistral AI holds a special place for me due to its Apache 2 open-source license. It truly embodies the spirit of "Open" AI.


What is Mixtral?

Mixtral is a new open-source (Apache 2) AI model that is based on the mixture-of-experts (MoE) architecture. MoE models are a type of neural network that consists of multiple expert networks. Each expert network is trained on a different subset of the data. When a new data point is presented to the model, the model selects the expert network that is most likely to be able to make an accurate prediction.

Monday, December 11, 2023

The Future is Serverless: The Path to High-Flying DORA Metrics (Across Cloud Providers)

Having implemented a large scale Serverless integration middleware platform on AWS for a client, I thought of writing this blog post to note how serverless best practices can help organizations improve their DevOps Research and Assessment (DORA) metrics. 

DORA metrics are a set of four key metrics that measure the performance of software delivery teams:

  • Deployment frequency: How often does the team release new features to production?
  • Lead time for changes: How long does it take for a code change to be deployed to production?
  • Mean time to restore (MTTR): How long does it take to recover from a production incident?
  • Change failure rate: What percentage of deployments cause a production incident?

Saturday, December 09, 2023

Hallucinations and Large Language Model Assistants: A Look at the Problem and Potential Solutions

How LSD Can Make Us Lose Our Sense of Self - Neuroscience News

 Generative AI, with its incredible ability to create text, code, images, and music, has become a powerful tool across various industries. However, a growing concern exists surrounding "hallucinations," where AI models generate inaccurate, misleading, or outright false outputs. This phenomenon poses significant risks, from spreading misinformation to undermining the credibility of AI-generated content.


What do the experts say?

The reason I wrote this post was to capture the essence of one of Andrej Karpathy's recent tweets.

The Big Three of Code AI: Duet AI, GitHub Copilot, and AWS CodeWhisperer

The rise of artificial intelligence (AI) has transformed many industries, and the field of software development is no exception. AI-powered code assistants like Duet AI from Google, GitHub Copilot from Microsoft, and CodeWhisperer from Amazon Web Services (AWS) are changing the way developers work by providing intelligent suggestions, automating repetitive tasks, and helping them write better code faster.

While these three offerings share a common goal, they approach code assistance in slightly different ways, catering to specific needs and workflows. Let's delve into their individual strengths and weaknesses to help you choose the best AI companion for your coding journey.


Duet AI

Let's start with the new kid on the block, who doesn't seem to get mentioned a lot.


Google's entry into the code assistance arena aims to be a versatile companion for developers across various disciplines. It leverages the power of Google's foundation models, including Codey, to offer code recommendations, generate code snippets, and even translate between different programming languages. Duet AI also boasts strong integration with Google Cloud services, making it a natural choice for developers working within that ecosystem.

Friday, December 08, 2023

Playing God - The Ethical Implications of Artificial General Intelligence (AGI)

Two days ago, Google unveiled Gemini, a new AI model that is capable of understanding and generating text, code, audio, images, and video. Although I wouldn't categorise Gemini as Artificial General Intelligence (AGI), it is a significant step towards that eventuality.

A Technical Report is available at https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

There is no doubt that Gemini is a powerful tool. I might even argue that it is a OpenAI killer, and the year 2024 looks like the year of Gemini, just like 2023 was dominated by OpenAI and ChatGPT,  It has the potential to revolutionize the way we interact with technology. However, it is important to remember that AI is a tool, and like any tool, it can be used for good or evil. It is up to us to ensure that Gemini is used for the benefit of humanity.

Wednesday, December 06, 2023

Unlocking the Power of Large Language Models: A Journey with Andrej Karpathy into the Future of AI

Step into the captivating realm of Large Language Models (LLMs) with this must-watch video featuring the brilliant Andrej Karpathy.

In this video, Karpathy unfolds the intricacies of LLMs, offering a fascinating glimpse into their training methodologies, capabilities, and the thrilling promises they bring to the table. Imagine a world where artificial intelligence seamlessly generates text, translates languages, crafts creative content, and provides informative responses – LLMs make this a reality. Their ability to not only absorb vast amounts of data but also adapt to new information is truly mind-boggling.

Tuesday, December 05, 2023

The Frugal Architect

During his keynote at AWS re:Invent 2023, Dr. Werner Vogels discussed several crucial considerations for architects designing distributed systems in today's cloud-native era. These seven laws encompass cost optimization, resilience, profiling, application risk categorization, and observability—factors most of us inevitably take into account when crafting solutions for our customers. 

Notably, this was the first instance I encountered where these principles were neatly presented on a website; thefrugalarchitect.com.