파이썬에서 비동기 LLM API 호출: 포괄적인 가이드

aayush mittal

10개월 ago

개발자이자 dta 과학자로서 우리는 종종 API를 통해 이러한 강력한 모델과 상호 작용해야 합니다. 그러나 애플리케이션의 복잡성과 규모가 커짐에 따라 효율적이고 성능이 뛰어난 API 상호 작용에 대한 필요성이 중요해집니다. 여기서 비동기 프로그래밍이 빛을 발하며, LLM API로 작업할 때 처리량을 극대화하고 대기 시간을 최소화할 수 있습니다.

이 포괄적인 가이드에서는 Python에서 비동기 LLM API 호출의 세계를 살펴보겠습니다. 비동기 프로그래밍의 기본부터 복잡한 워크플로우를 처리하기 위한 고급 기술까지 모든 것을 다룹니다. 이 기사를 마치면 비동기 프로그래밍을 활용하여 LLM 기반 애플리케이션을 강화하는 방법을 확실히 이해하게 될 것입니다.

비동기 LLM API 호출의 세부 사항을 살펴보기 전에 비동기 프로그래밍 개념에 대한 튼튼한 기초를 마련해 보겠습니다.

비동기 프로그래밍은 실행의 주 스레드를 차단하지 않고 여러 작업을 동시에 실행할 수 있게 해줍니다. Python에서 이는 주로 다음을 통해 달성됩니다. 비동기 코루틴, 이벤트 루프, 퓨처를 사용하여 동시성 코드를 작성하기 위한 프레임워크를 제공하는 모듈입니다.

주요 개념:

코루틴: 정의된 함수 비동기 정의 일시 정지 및 재개가 가능합니다.
이벤트 루프: 비동기 작업을 관리하고 실행하는 중앙 실행 메커니즘입니다.
기대되는 것들: await 키워드와 함께 사용할 수 있는 객체(코루틴, 태스크, 퓨처).

이러한 개념을 설명하기 위한 간단한 예는 다음과 같습니다.

import asyncioasync def greet(name):    await asyncio.sleep(1)  # Simulate an I/O operation    print(f"Hello, {name}!")async def main():    await asyncio.gather(        greet("Alice"),        greet("Bob"),        greet("Charlie")    )asyncio.run(main())

이 예에서 우리는 비동기 함수를 정의합니다. greet I/O 작업을 시뮬레이션하는 asyncio.sleep(). 그 main 기능 사용 asyncio.gather() 여러 인사말을 동시에 실행합니다. sleep 지연에도 불구하고 세 인사말이 모두 약 1초 후에 인쇄되어 비동기 실행의 힘을 보여줍니다.

LLM API 호출에서 비동기의 필요성

LLM API로 작업할 때, 우리는 종종 여러 API 호출을 순서대로 또는 병렬로 해야 하는 시나리오에 직면합니다. 기존의 동기 코드는 상당한 성능 병목 현상으로 이어질 수 있으며, 특히 LLM 서비스에 대한 네트워크 요청과 같은 고 지연 작업을 처리할 때 그렇습니다.

LLM API를 사용하여 100개의 서로 다른 기사에 대한 요약을 생성해야 하는 시나리오를 생각해 보겠습니다. 동기적 접근 방식을 사용하면 각 API 호출은 응답을 받을 때까지 차단되어 모든 요청을 완료하는 데 몇 분이 걸릴 수 있습니다. 반면 비동기적 접근 방식을 사용하면 여러 API 호출을 동시에 시작할 수 있어 전체 실행 시간을 크게 줄일 수 있습니다.

환경 설정하기

비동기 LLM API 호출을 시작하려면 필요한 라이브러리로 Python 환경을 설정해야 합니다. 필요한 것은 다음과 같습니다.

파이썬 3.7 또는 그 이상(네이티브 asyncio 지원의 경우)
aiohttp: 비동기 HTTP 클라이언트 라이브러리
열어라: 공식 OpenAI Python 클라이언트 (OpenAI의 GPT 모델을 사용하는 경우)
랭체인: LLM을 사용하여 애플리케이션을 구축하기 위한 프레임워크(선택 사항이지만 복잡한 워크플로에 권장됨)

pip를 사용하여 이러한 종속성을 설치할 수 있습니다.

pip install aiohttp openai langchain
asyncio 및 aiohttp를 사용한 기본 비동기 LLM API 호출
aiohttp를 사용하여 LLM API에 대한 간단한 비동기 호출을 만드는 것으로 시작해 보겠습니다. OpenAI의 GPT-3.5 API를 예로 들겠지만, 이 개념은 다른 LLM API에도 적용됩니다.

import asyncioimport aiohttpfrom openai import AsyncOpenAIasync def generate_text(prompt, client):    response = await client.chat.completions.create(        model="gpt-3.5-turbo",        messages=[{"role": "user", "content": prompt}]    )    return response.choices[0].message.contentasync def main():    prompts = [        "Explain quantum computing in simple terms.",        "Write a haiku about artificial intelligence.",        "Describe the process of photosynthesis."    ]        async with AsyncOpenAI() as client:        tasks = [generate_text(prompt, client) for prompt in prompts]        results = await asyncio.gather(*tasks)        for prompt, result in zip(prompts, results):        print(f"Prompt: {prompt}nResponse: {result}n")asyncio.run(main())

이 예에서 우리는 비동기 함수를 정의합니다. generate_text AsyncOpenAI 클라이언트를 사용하여 OpenAI API를 호출합니다. main 이 기능은 다양한 프롬프트와 용도에 대해 여러 작업을 생성합니다. asyncio.gather() 동시에 실행합니다.
이 접근 방식을 사용하면 LLM API에 여러 요청을 동시에 보낼 수 있어 모든 프롬프트를 처리하는 데 필요한 총 시간을 크게 줄일 수 있습니다.
고급 기술: 배치 및 동시성 제어
이전 예제는 비동기 LLM API 호출의 기본을 보여주지만, 실제 애플리케이션은 종종 더 정교한 접근 방식을 요구합니다. 두 가지 중요한 기술, 즉 요청 일괄 처리와 동시성 제어를 살펴보겠습니다.
요청 일괄 처리: 많은 수의 프롬프트를 처리할 때 각 프롬프트에 대해 개별 요청을 보내는 것보다 그룹으로 일괄 처리하는 것이 더 효율적인 경우가 많습니다. 이렇게 하면 여러 API 호출의 오버헤드가 줄어들고 더 나은 성능을 얻을 수 있습니다.

import asynciofrom openai import AsyncOpenAIasync def process_batch(batch, client):    responses = await asyncio.gather(*[        client.chat.completions.create(            model="gpt-3.5-turbo",            messages=[{"role": "user", "content": prompt}]        ) for prompt in batch    ])    return [response.choices[0].message.content for response in responses]async def main():    prompts = [f"Tell me a fact about number {i}" for i in range(100)]    batch_size = 10        async with AsyncOpenAI() as client:        results = []        for i in range(0, len(prompts), batch_size):            batch = prompts[i:i+batch_size]            batch_results = await process_batch(batch, client)            results.extend(batch_results)        for prompt, result in zip(prompts, results):        print(f"Prompt: {prompt}nResponse: {result}n")asyncio.run(main())

동시성 제어: 비동기 프로그래밍은 동시 실행을 허용하지만, API 서버를 압도하거나 속도 제한을 초과하지 않도록 동시성 수준을 제어하는 것이 중요합니다. 이 목적을 위해 asyncio.Semaphore를 사용할 수 있습니다.

import asynciofrom openai import AsyncOpenAIasync def generate_text(prompt, client, semaphore):    async with semaphore:        response = await client.chat.completions.create(            model="gpt-3.5-turbo",            messages=[{"role": "user", "content": prompt}]        )        return response.choices[0].message.contentasync def main():    prompts = [f"Tell me a fact about number {i}" for i in range(100)]    max_concurrent_requests = 5    semaphore = asyncio.Semaphore(max_concurrent_requests)        async with AsyncOpenAI() as client:        tasks = [generate_text(prompt, client, semaphore) for prompt in prompts]        results = await asyncio.gather(*tasks)        for prompt, result in zip(prompts, results):        print(f"Prompt: {prompt}nResponse: {result}n")asyncio.run(main())

이 예제에서는 세마포어를 사용하여 동시 요청 수를 5개로 제한하여 API 서버에 과부하가 걸리지 않도록 합니다.
비동기 LLM 호출에서의 오류 처리 및 재시도
외부 API로 작업할 때는 견고한 오류 처리 및 재시도 메커니즘을 구현하는 것이 중요합니다. 일반적인 오류를 처리하고 재시도에 대한 지수 백오프를 구현하도록 코드를 개선해 보겠습니다.

import asyncioimport randomfrom openai import AsyncOpenAIfrom tenacity import retry, stop_after_attempt, wait_exponentialclass APIError(Exception):    pass@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))async def generate_text_with_retry(prompt, client):    try:        response = await client.chat.completions.create(            model="gpt-3.5-turbo",            messages=[{"role": "user", "content": prompt}]        )        return response.choices[0].message.content    except Exception as e:        print(f"Error occurred: {e}")        raise APIError("Failed to generate text")async def process_prompt(prompt, client, semaphore):    async with semaphore:        try:            result = await generate_text_with_retry(prompt, client)            return prompt, result        except APIError:            return prompt, "Failed to generate response after multiple attempts."async def main():    prompts = [f"Tell me a fact about number {i}" for i in range(20)]    max_concurrent_requests = 5    semaphore = asyncio.Semaphore(max_concurrent_requests)        async with AsyncOpenAI() as client:        tasks = [process_prompt(prompt, client, semaphore) for prompt in prompts]        results = await asyncio.gather(*tasks)        for prompt, result in results:        print(f"Prompt: {prompt}nResponse: {result}n")asyncio.run(main())

이 향상된 버전에는 다음이 포함됩니다.

관습 APIError API 관련 오류에 대한 예외입니다.
에이 generate_text_with_retry 장식된 기능 @retry tenacity 라이브러리에서 지수 백오프를 구현합니다.
오류 처리 process_prompt 오류를 포착하고 보고하는 기능.

성능 최적화: 스트리밍 응답
장문 콘텐츠 생성의 경우 스트리밍 응답은 애플리케이션의 인지된 성능을 크게 개선할 수 있습니다. 전체 응답을 기다리는 대신, 사용 가능해지면 텍스트 청크를 처리하고 표시할 수 있습니다.

import asynciofrom openai import AsyncOpenAIasync def stream_text(prompt, client):    stream = await client.chat.completions.create(        model="gpt-3.5-turbo",        messages=[{"role": "user", "content": prompt}],        stream=True    )        full_response = ""    async for chunk in stream:        if chunk.choices[0].delta.content is not None:            content = chunk.choices[0].delta.content            full_response += content            print(content, end='', flush=True)        print("n")    return full_responseasync def main():    prompt = "Write a short story about a time-traveling scientist."        async with AsyncOpenAI() as client:        result = await stream_text(prompt, client)        print(f"Full response:n{result}")asyncio.run(main())

이 예제는 API에서 응답을 스트리밍하고 도착하는 대로 각 청크를 인쇄하는 방법을 보여줍니다. 이 접근 방식은 특히 채팅 애플리케이션이나 사용자에게 실시간 피드백을 제공하려는 모든 시나리오에 유용합니다.
LangChain을 사용하여 비동기 워크플로 구축
더 복잡한 LLM 기반 애플리케이션의 경우 LangChain 프레임워크 여러 LLM 호출을 체인으로 연결하고 다른 도구를 통합하는 프로세스를 간소화하는 고수준 추상화를 제공합니다. 비동기 기능을 갖춘 LangChain을 사용하는 예를 살펴보겠습니다.
이 예에서는 LangChain을 사용하여 스트리밍 및 비동기 실행을 통해 더 복잡한 워크플로를 만드는 방법을 보여줍니다. AsyncCallbackManager 그리고 StreamingStdOutCallbackHandler 생성된 콘텐츠의 실시간 스트리밍을 활성화합니다.

import asynciofrom langchain.llms import OpenAIfrom langchain.prompts import PromptTemplatefrom langchain.chains import LLMChainfrom langchain.callbacks.manager import AsyncCallbackManagerfrom langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandlerasync def generate_story(topic):    llm = OpenAI(temperature=0.7, streaming=True, callback_manager=AsyncCallbackManager([StreamingStdOutCallbackHandler()]))    prompt = PromptTemplate(        input_variables=["topic"],        template="Write a short story about {topic}."    )    chain = LLMChain(llm=llm, prompt=prompt)    return await chain.arun(topic=topic)async def main():    topics = ["a magical forest", "a futuristic city", "an underwater civilization"]    tasks = [generate_story(topic) for topic in topics]    stories = await asyncio.gather(*tasks)        for topic, story in zip(topics, stories):        print(f"nTopic: {topic}nStory: {story}n{'='*50}n")asyncio.run(main())

FastAPI를 사용한 비동기 LLM 애플리케이션 제공
비동기 LLM 애플리케이션을 웹 서비스로 사용할 수 있도록 하려면 FastAPI가 비동기 작업에 대한 기본 지원으로 인해 좋은 선택입니다. 다음은 텍스트 생성을 위한 간단한 API 엔드포인트를 만드는 방법의 예입니다.

from fastapi import FastAPI, BackgroundTasksfrom pydantic import BaseModelfrom openai import AsyncOpenAIapp = FastAPI()client = AsyncOpenAI()class GenerationRequest(BaseModel):    prompt: strclass GenerationResponse(BaseModel):    generated_text: str@app.post("https://www.unite.ai/generate", response_model=GenerationResponse)async def generate_text(request: GenerationRequest, background_tasks: BackgroundTasks):    response = await client.chat.completions.create(        model="gpt-3.5-turbo",        messages=[{"role": "user", "content": request.prompt}]    )    generated_text = response.choices[0].message.content        # Simulate some post-processing in the background    background_tasks.add_task(log_generation, request.prompt, generated_text)        return GenerationResponse(generated_text=generated_text)async def log_generation(prompt: str, generated_text: str):    # Simulate logging or additional processing    await asyncio.sleep(2)    print(f"Logged: Prompt '{prompt}' generated text of length {len(generated_text)}")if __name__ == "__main__":    import uvicorn    uvicorn.run(app, host="0.0.0.0", port=8000)

이 FastAPI 애플리케이션은 엔드포인트를 생성합니다. /generate 프롬프트를 받아들이고 생성된 텍스트를 반환하는 것입니다. 또한 응답을 차단하지 않고 추가 처리를 위해 백그라운드 작업을 사용하는 방법도 보여줍니다.
모범 사례와 일반적인 함정
비동기 LLM API를 사용할 때 다음 모범 사례를 염두에 두십시오.

연결 풀링 사용: 여러 요청을 하는 경우, 오버헤드를 줄이기 위해 연결을 재사용합니다.
적절한 오류 처리를 구현하세요: 네트워크 문제, API 오류, 예상치 못한 응답에 항상 대비하세요.
요금 제한을 존중하세요: API에 과부하가 걸리는 것을 방지하려면 세마포어나 기타 동시성 제어 메커니즘을 사용하세요.
모니터 및 기록: 성능을 추적하고 문제를 식별하기 위해 포괄적인 로깅을 구현합니다.
장편 콘텐츠에는 스트리밍을 사용하세요: 사용자 경험을 향상시키고 부분적인 결과의 조기 처리가 가능합니다.

게시물 파이썬에서 비동기 LLM API 호출: 포괄적인 가이드 처음 등장 유나이트.AI.


관련된 글:

Pipio 리뷰: AI 아바타를 위한 가장 정확한 립싱크
VPN을 사용하여 제한된 지역에서 Meta AI 및 Llama 3.1 405B에 액세스하기
Meta, 주요 WhatsApp AI 업데이트 출시(2024년 8월)
Internxt 리뷰: 가장 안전한 Google Drive 대안


	

	
			
		Categories: 전문가 칼럼, 최신칼럼	

		
		
			Leave a Comment		
	
	



	
		AI 뉴스허브
		Back to top
	



		
			
				Exit mobile version