David Szigeti David Szigeti

RAGs to RIChes

When it comes to code generative AI, RAG (Retrieval-Augmented Generation) is like the internet for computers. However, the common RAG approach of chunking files and indexing isn't ideal for working with git projects. It deteriorates contextual awareness and creates a mismatch between code and natural language. We've developed a novel technique called RIC: Retrieval Input Compression. RIC leverages the strength of LLMs to find the correct files to answer your prompts. Here's how it works:

1. We index the git history and files statically.
2. We compress the information at the retrieval stage, retaining semantic meaning.
3. This gives the LLM context-fitting data to infer the exact appropriate files.

The results are astounding.

Chatting with OpenStack's Keystone project with nearly 1500 files.

I'm pleased to announce that we came up with a novel technique that leverages the strength of LLMs to find the correct files to answer your prompts. We're calling the strategy RIC: Retrieval Input Compression. We index the git history and the files statically. Then, essentially, we compress the information at the retrieval stage, which retains the semantic meaning. This gives the LLM context-fitting data to infer the exact appropriate files. It solves the apples to oranges problem and eliminates the necessity for file chunking. The results are astounding. We can chat with git projects at scale.

Here is a video where I chat with Algorand's indexer project, which ingests Algrorand transactions for ease of discoverability. It has nearly 400 files checked into git, and it works unreasonably well.

The other key booster over traditional RAG is that it indexes in vast amounts of dimensions as it's tightly integrated into git. Not only is it understanding the code based on the commit history, but also the static state of the files checked into git. As a result, it can find the right file(s) like a needle in a haystack.

Unix philosophy

The final point is more of a personal qualm. I hate the locked-in nature of current chat services. Your chats are locked away unless you want to bother copying, pasting, and parsing the conversations. Also, these web apps are not very unixy. I want to be able to prompt right in the git directory where I'm working. Why not pipe the output of machtiani into some other tool? Why not version control the chats, so you can go back a few conversation exchanges and go in a different branch of discussion without losing what you already did? So I built a CLI for it that outputs nice and neat and saves the conversation in a chat folder. It names the file in a human-understandable way, like current chat services do with your chats. Now the possibility of switching between models, closed API services to free and open LLMs, is a reality.

Hallucinations

It's garbage in, garbage out. This is fundamentally why LLMs hallucinate, at the end of the day. LLMs hallucinate a lot without high-quality context to guide them. Nothing is a better antidote for mitigating hallucinations than answering based on real files and data in context. In the rare hallucination, never in my experience has it talked about a non-existent file or code that doesn't really exist; it appears to be just more in the line of mistakes. Reimplementing a struct that it saw was already defined, or something like that. But not just making up stuff.

Current limitations

The only limitations we've experienced are generally that latency deteriorates with the size of the project. There are a few steps that require checking files and diffs in git. It can be optimized to be more concurrent, to batch process, and even to just be faster by brute force by using a compiled language. Currently, the service is written in Python without leveraging C/C++. We aim to improve it on those fronts, which should make arbitrarily large projects only a few seconds slower than the inference itself.

We coupled it to OpenAI simply to get a useful service out the door. It's a high priority to make it work with any model simply by modifying your configs without having to wait for the project to support it.

Open source dedication

First, let me say right away that it's not currently open source. But believe me, I have no interest in clutching to this like Gollum in Lord of the Rings. I'm acutely aware that machtiani is built completely on open source, not to mention the LLMs trained on public data under Fair Use, at best. And worrying about scaling, security, and marketing takes away from development, not to mention the network effect of a contributor community and the feedback loop that open source allows. There are no plans, and we are morally against, dual licenses. It will be MIT licensed!

The plan is to raise some cash to afford to spend some more time getting the codebase more modular and cleaned up for contributors. Please afford us that. Also, many, including myself personally, would prefer to use a reasonably priced hosted service, but with the peace of mind that it is open source. We're a small company with only two full-timers working on it. And I'm the sole developer.

Plans

Service-first. Hoping that other services, like Khoj or other upstarts will use the strategy, or just drop in machtiani's file retrieval (can be used without the CLI) to take things to the next level. I'd like to see (neo)vim or emacs plugins. I'm sort of against these aspiring god-like products or services, like VScode, or Github, that roll everything in and gatekeep upstarts and wall off competition.

Tests. Adding more features without tests will just lead to regressions. So I'm hesitant to add features, lest it turn into bloat and technical debt.

Optimize. File operations are synchronous where they should not be. We must leverage batch processing and be smarter about doing things in memory to avoid the long round-trips. Also, data structures. The data structures used are for developer ease, not speed. Finally, the whole thing should be rewritten in Go, maybe Rust, but most likely Go.

Web. I'd like to add a --web flag so that it can include context found on the web.

Open source. Close sourcing is a nightmare scenario for me. I'd rather be using it to build things efficiently and quickly rather than spending my life on this. Hosting at scale is a whole other class of issues unrelated to the core project. There is no reason why users shouldn't be able to run this as a local service on their desktop (or mobile device), save if they want it to work against a private LLM with tens of billions of parameters.

Execution feedback loop Thanks to machtiani, I figured out how to customize SWE-agent, an open source service similar to Devin, to use machtiani's code retrieval prowess. SWE-agent can close nearly 13% of github issues. And it relies on really terrible code search and discovery tools. The idea, by swapping these out, it should perform more efficiently and effectively.

Fine-tuned Open LLM. New coding data that LLMs will be trained on will be mostly synthetic. As more and more developers use LLMs to generate code, the less entropy there will be. It's like digital mad cow disease, some are predicting, where the LLM is fed its own byproduct. If we can train it on execution results, whether the code compiled, ran, or tests were passed, this creates a super-nova of entropy and high quality data. And machtiani works natively with git. So machtiani will be able to have fine-grained training data at every commit level of the code base, if an execution feedback loop is implemented. And it empowers individuals and small organizations. You only need a few thousand examples to fine-tune, and each commit can be an example. Hopefully, users can be incentivized to contribute to an open source LLM that I think will be more effective than a giant, mega corpo zombie LLM fed on unreliable 'thumbs up' or 'thumbs down' chat histories.

What does Machtiani mean?

It's a mashed up and slightly shuffled name based on Latin and Nahuatl (the language of the Aztec Empire). It's intended to convey the idea of 'machine'. Marrying two worlds or ways of things, generative AI and the git project that preceded generative AI, but will live on in a new way. Mostly, it just looks and sounds good, so don't take it too seriously.

Read More