MongoDB와 Gemma를 사용한 RAG 실습(LangChain X)

BangPro 2024. 5. 28. 18:49

2024. 5. 28. 18:49

728x90

설명

사용 모델 : gemma
목적 : Langchain 프레임워크 기반으로 오픈소스 LLM과 문서 임베딩을 직접 수행하여 vectore db에 저장한 후 RAG 시스템을 구현하는 것
기술 스택
- Langchian
- mongo db
- RAG
- gemma

STEP 1 : 라이브러리 설치

PyMongo: MongoDB와 상호 작용하는 Python 라이브러리로, 클러스터에 연결하고 컬렉션 및 문서에 저장된 데이터를 쿼리하는 기능을 제공합니다.
Pandas: 효율적인 데이터 처리 및 분석을 위해 Python을 사용하는 데이터 구조를 제공합니다.
Hugging Face datasets: 오디오, 비전, 텍스트 데이터셋을 보유하고 있습니다.
Hugging Face Accelerate: GPU와 같은 하드웨어 가속기를 사용하는 코드 작성의 복잡성을 추상화합니다. Accelerate는 GPU 리소스에서 Gemma 모델을 활용하기 위해 사용됩니다.
Hugging Face Transformers: 사전 훈련된 모델 컬렉션에 접근을 제공합니다.
Hugging Face Sentence Transformers: 문장, 텍스트, 이미지 임베딩에 접근을 제공합니다.

!pip install datasets pandas pymongo sentence_transformers -q
!pip install -U transformers -q
# 아래 라이브러리는 GPU 사용시 설치
!pip install accelerate -q

STEP 2 : 데이터 준비

이번 튜토리얼에서 사용하는 데이터는 Hugging Face Datasets의 데이터
- https://huggingface.co/datasets/MongoDB/embedded_movies

#데이터셋 로드
from datasets import load_dataset
import pandas as pd

dataset = load_dataset('AIatMongoDB/embedded_movies')

dataset_df = pd.DataFrame(dataset['train'])
dataset_df.head()

우선 dropna를 사용해서 데이터의 "fullplot" 속성이 비어있지 않게 설정한다.
두번째로 "plot_embedding" 속성을 제거한다. 그리고 gte-large 모델로 새로 임베딩을 생성해서 저장한다

#비어있는 데이터를 제거한다
dataset_df = dataset_df.dropna(subset=['fullplot'])
print('\nNumber of missing values in each column after removal')
print(dataset_df.isnull().sum())

#plot_embedding 제거
dataset_df = dataset_df.drop(columns = ["plot_embedding"])
dataset_df.head(5)

STEP 3 : Embedding 생성

임베딩 모델에 접근하기 위해서 SentenceTransformers import
SentenceTransformers를 사용해서 gte_large 임베딩 모델 로드
get_embedding 함수 정의
- 텍스트 문자열을 입력으로 받아서 임베딩을 출력하는 함수
- 우선 입력 텍스트가 비어있는지 확인 후에 비어있으면 빈 리스트를 반환하고 아니면 임베디
"fullplot" 열의 값을 get_embedding 함수에 넣어서 각 영화마다 임베딩을 생성한다. 새로 생성된 임베딩값들은 새로운 열에 할당된다.

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("thenlper/gte-large")

def get_embedding(text:str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()

dataset_df["embedding"] = dataset_df["fullplot"].apply(get_embedding)

dataset_df.head()

STEP 4 : 데이터베이스 셋업 및 연결

mngo db는 일반적인 디비와 벡터 디비 역할 모두 수행. 효율적으로 저장하고 query하고 벡터임베딩을 검색 -> 데이터베이스 관리, 유지, 비용
새로운 MongoDB 데이터베이스를 생성하기 위해서 데이터베이스 클러스터를 셋업해야한다.

다음 링크로 들어가서 무료 MongoDB 회원가입을 한다. 👉 링크
'Database' 옵션을 선택하면 현존하는 클러스터들의 배포 정보가 있는 데이터베이스 배포 페이지로 이동한다. 'build cluster' 버튼을 눌러서 새로운 데이터베이스 클러스터를 생성한다
데이터 베이스에 적용 가능한 모든 configuration을 선택하고 모든 configuration 옵션을 선택했으면 'Create Cluster' 버튼을 클릭해서 새로 생성한 클러스터를 배포한다. MongoDB는 shared tab을 통해서 무료 클러스터 생성도 지원한다.

Concept를 생성할때 Python host의 ip를 whitelist에 추가하거나(열어두거나), 0.0.0.0/0으로 ip를 설정해야한다.

클러스터 생성 및 배포 후 데이터베이스 배포 페이지에서 클러스터를 접근할 수 있다.
클러스터의 'connect' 버튼을 눌러서 다양한 언어 driver를 통해 연결하기 위한 옵션을 볼 수 있다. 여기서 driver를 선택하면 URI를 확인할 수 있다.
이 실습은 cluster의 URI만 필요하다. 이를 google colab 환경변수로 설정해주면 된다. 변수명은 MONGO_URI

step 4.1 데이터베이스 및 콜렉션 셋업

사전 준비할 사항을 미리 확인하자

MongoDB Atlas를 위한 Database cluster 설정
클러스터 URI 확보
클러스터를 생성한 후에는 MongoDB Atlas cluster 내부에 데이터베이스와 콜렉션을 만들어야한다.
좌측에서 데이터베이스 클릭 > browse collection > create 를 눌러서 생성할 수 있다. Database : movies, collection : movie_collection_2이다.

STEP 5 : 벡터 검색 인덱스 생성

이 과정에서는 MongoDB Atlas에 벡터 인덱스가 생성이 되어야한다.
다음 단계는 정확하고 효율적인 벡터 기반 검색을 위한 필수 과정이다.

database collection에 들어가서 search index를 누른다
Atlas Vector Search의 JSON Editor를 누른다
아래 json 형식대로 생성한다.
기다리면 status가 Not Ready에서 Active로 바뀐다

벡터 검색 인덱스를 만들면 문서를 효율적으로 탐색해서 벡터 유사도 기반으로 쿼리 임베딩과 가장 유사도가 높은 임베딩을 가진 문서를 검색한다.
벡터 검색 인덱스에 대한 자세한 정보는 여기
```
{
"fields": [{
   "numDimensions": 1024,
   "path": "embedding",
   "similarity": "cosine",
   "type": "vector"
 }]
}
```
numDimension 필드의 값 1024는 gte-large 임베딩 모델이 생성하는 차원에 해당합니다. gte-base나 gte-small 모델을 사용하게 되면 numDimension 값이 각각 768이나 384가 된다.

STEP 6 : 데이터 연결 생성

아래 코드를 통해서 PyMongo를 활용해서 MongoDB 클라이언트 객체를 생성한다. 이를 통해 클러스터와 연결하고 데이터베이스와 콜렉션에 접근할 수 있다.

import pymongo
from google.colab import userdata

def get_mongo_client(mongo_url):
    """ Establish conncetion to the Mongo DB"""
    try:
        client = pymongo.MongoClient(mongo_uri)
        print("Connection to MongoDB successful")

        return client
    except pymongo.errors.ConnectionFailure as e:
        print(f"Connection failed {e}")
        return None

mongo_uri = userdata.get("MONGO_URI")
if not mongo_uri:
    print("MONGO URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

#ingest data into Mongo DB
db = mongo_client["movies"]
collection = db["movie_collection_2"]

#Delete any existing records in the collection
collection.delete_many({})

documents = dataset_df.to_dict("records")
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

STEP 7 : 사용자의 질문에 벡터 검색 수행

다음 단계는 쿼리 임베딩을 생성하고 MongoDB 검색 파이프라인을 정의해서 벡터 검색 결과를 반환하는 함수를 구현합니다.

파이프라인은 Vector Search 단계와 Project 단계로 구성되어 있고 생성된 벡터로 쿼리를 실행하면서 결과에 대한 검색 점수를 통합하는 동시에 줄거리, 제목, 장르와 같이 필요한 정보 만을 포함하도록 결과 형식을 지정합니다.

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed"

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "fullplot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1,  # Include the genres field
                "score": {"$meta": "vectorSearchScore"},  # Include the search score
            }
        },
    ]

    # Excute the search
    results = collection.aggregate(pipeline)
    return list(results)

STEP 8 : 사용자 쿼리 처리 및 Gemma 로드

구글 gemma를 사용하려면 huggingface 사이트에서 인가를 받아야한다. 👉링크
인가를 받은 후 huggingface의 access token을 코랩 환경변수로 설정 한 후 모델 로드시 매개변수로 전달한다.

def get_search_result(query, collection):
    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot','N/A')}\n"

        return search_result

# Conduct query with retival of sources
query = "What is the best romantic movie to watch and why?"
source_information = get_search_result(query,collection)
combined_information = f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."

print(combined_information)

import os
from google.colab import userdata
HUGGINGFACE_API_KEY = userdata.get('HUGGINGFACE_API_KEY')

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it", token = HUGGINGFACE_API_KEY)
#CPU를 사용하는 경우 아래 주석을 해제
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
#GPU를 사용하는 경우 아래 주석을 해제
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it",token = HUGGINGFACE_API_KEY, device_map = "auto")

결과

Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?

# GPU로 텐서 이동
input_ids= tokenizer(combined_information, return_tensors="pt").to("cuda")
response = model.generate(**input_ids, max_new_tokens=500)
print(tokenizer.decode(response[0]))

결과

<bos>Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
.

The best romantic movie to watch is **Shut Up and Kiss Me!** because it perfectly captures the essence of romantic comedies. The movie perfectly balances humor, heart, and romance, making it a delightful and unforgettable viewing experience.<eos>

rag_with_hugging_face_gemma_mongodb.ipynb

'인공지능 > RAG' 카테고리의 다른 글

Langchain - Ensemble Retriever (1)	2024.05.31
Langchain - Hybrid Search 구현 (0)	2024.05.31
Langchain - MessagesPlaceholder (0)	2024.05.24
Langchain - ChatPromptTemplate (0)	2024.05.23
LangChain (9) Retrieval - Retriever (0)	2024.04.08

방프로의 기술 블로그