PostgreSQL

This page explains how to retrieve data from PostgreSQL for RAG.

Semantic search

Overview

Install PgVector extension to the PostgreSQL database.
Add a vector column to a new or existing table.
Data <=> Embedding synchronization
1. Load data which you want to search from.
2. Pass the data to an embedding model and get an vector.
3. Set the vector to the vector column.
Retrieval
1. Pass the user input to an embedding model and get an vector.
2. Get the nearest neighbors to the user input vector e.g. SELECT * FROM a_table ORDER BY vector_column <-> '<user-input-vector>' LIMIT 5;

Vector store with PgVector

To store the embeddings for semantic search, we need to add a vector store in GitLab PostgreSQL. This vector store can be added by installing PgVector extension (Postgres 12+ is required). A vector store is currently running on GitLab.com and it's separately hosted from the main/CI databases.

Our current architecture of having a separate database for embeddings is probably ideal. We don't gain much by combining them and, as PGVector is all new and will likely require a lot of experimenting to get performance at scale (today we only have a tiny amount of data in it), we'll have a lot more options to experiment with without impacting overall GitLab.com stability (if PGVector is on a separate database). Having a separate database is recommended because it allows for experimentation without impacting performance of the main database.

Limitations

It could be locked down to a specific embedding model, because you must specify the dimensions of the vector column.
Vectors with up to 2,000 dimensions can be indexed.

Performance and scalability implications

Is there any guidance on how much data we can add to the PostgreSQL (regardless of the vector data or normal data)?
- Not really, as we do not usually just add data to the database, but rather it's a result of the instance being used. I don't see any specific storage requirements. If the existing vertex_gitlab_docs table size is a good indicator, we probably can add this without causing much trouble, though having an option to opt-in or opt-out is preferable.

Availability

PostgreSQL is availble in all GitLab installations (both CNG and Omnibus).
Most major cloud providers have added PgVector to their offerings by now: Google Cloud SQL and Alloy DB, DigitalOcean, AWS RDS and Aurora, Azure Flexible and Cosmos, etc. There might be a case where customers would need to upgrade to versions that support PGVector.

ID search

Overview

Execute a few-shot prompts to extract a resource identifier from the user input.
- e.g. When user asks Can you summarize #12312312?, ResourceIdentifier is 12312312 as a GitLab-Issue.
Retrieve the record from the PostgreSQL. e.g. Issue.find(12312312)
Check if the user can read the resource.
Build a prompt with the retrieved data and passing it to an LLM to get a AI-generated response.

PoC: Repository X Ray

Repository X Ray hasn't yet implemented any semantic seach and this section is based soely on a prototype implementation

Statistics (as of February 2024):
- Date type: JSON document with source code libraries desctiptions in natural language
- Date access level: Red (each JSON document belongs to specific project, and data access rules should adhere to data access rules configure for that project)
- Data source: Repository X Ray report CI artifact
- Data size: N/A
- Example of user input: "# generate function that fetches sales report for vendor from App Store"
- Example of expected AI-generated response:
```
def sales_reports(vendor_id)\n  app_store_connect.sales_reports(\n  filter: {\n    report_type: 'SALES',\n    report_sub_type: 'SUMMARY',\n    frequency: 'DAILY',
  vendor_number: '123456'\n  }\n)\nend
```

Synchronizing embeddings with data source

In symilar manner as with the documentation example Repository X Ray report data is a derivative. It uses an underlaying repository source code as a base, and it must be synchronised with it, whenever any changes to the source code occurs.

Right now there is no synchronisation mechanism that includes embeddings and vector storage. However there is an existing pipeline that generates and stores Repository X Ray reports

The ingestion pipeline is performed in following steps:

A CI X Ray scanner job is triggered - a documentation page suggest limiting this job to be executed only when changes occur to the main repository branch. However repository maintainers may configure trigger rules differently.
1. An X Ray scanner locates and process one of the supported dependencies files, producing JSON report files
After the X Ray scanner job finishes successfully, a background job is triggered in GitLab Rails monolith that imports JSON report into Projects::XrayReport
1. There can be only one Repository X Ray report per project in the scope of programming language, duplicated records are being upserted during import process

As of today, there are 84 rows on xray_reports table on GitLab.com.

Retrieval

After Repository X Ray report gets imported, when IDE extension sends request for a code generation, Repository X Ray report is retrieved, in following steps

Fetch embedding of the user input from textembedding-gecko model (768 dimensions).

Query to vertex_gitlab_docs table for finding the nearest neighbors. For example:

SELECT *
FROM vertex_gitlab_docs
ORDER BY vertex_gitlab_docs.embedding <=> '[vectors of user input]'               -- nearest neighbors by cosine distance
LIMIT 10

GitLab Rails monotlith fetches corresponding xray_reports record from main database. xray_reports records are filiterd based on project_id foreign key, and lang columns.
From retrieved record first 50 dependencies are being added into a prompt that is forwarded to AI Gateway

Current state overview

sequenceDiagram
    actor USR as  User
    participant IDE
    participant GLR as GitLabRails
    participant RN as GitLabRunner
    participant PG as GitLabPsqlMainDB
    participant AIGW as AIGateway
    USR->>+GLR: commits changes to Gemfile.lock
    GLR->>RN: triggers Repository X Ray CI scanner job
    RN->>GLR: Repository X Ray report
    GLR->>GLR: triggers Repository X Ray ingestion job
    GLR->>-PG: upserts xray_reports record

    USR->>+IDE: types: "#35; generate function that fetches sales report for vendor from App Store"
    IDE->>+GLR: trigger code generation for line ` "#35; generate function `
    GLR->>PG: fetch X Ray report for project and language
    PG->>GLR:  xray_reports record
    GLR->>GLR: include first 50 entities from xray report into code generation prompt
    GLR->>-AIGW: trigger code generation ` "#35; generate function `

Embeddings prospect application

As described in retrieval section above, currently Repository X Ray reports follows very naive approach, that does not iclude any metric for assesing relevance between Repository X Ray report content and user instruction. Therefore applying embeddings and semantic search to X Ray report has a high potential of improving results by selecting limited set of related entries from Repository X Ray report based on user instruction.

To achieve that embeddings should be generated during Repository X Ray ingestion. Additionaly an user instruction should be turned into embeddings vector to perform semantic search over stored Repository X Ray report data during retrieval process.