Live training | 2023-06-13 | Building AI Applications with LangChain and GPT
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    You've probably talked to ChatGPT using the web interface, or used the API with the openai python package and wondered "what if I could teach it about my own data?". Today we're going to build such an application using LangChain, a framework for developing applications powered by language models.

    In today's session, we'll build a chatbot powered by GPT-3.5 that can answer questions about LangChain, as it will have knowledge of the LangChain documentation. We'll cover:

    • Getting setup with an OpenAI developer account and integration with Workspace;
    • Install the LangChain package
    • Preparing the data
    • Embed the data using OpenAI's Embed API, and get a cost estimate for this operation
    • Storing the data in a vector database
    • How to query the vector database
    • Putting together a basic chat application to "talk to the LangChain docs"

    Live training | 2023-06-13 | Building AI Applications with LangChain and GPT

    Before you begin

    Unzip the required data by running the following cell

    !test -f contents.zip && unzip contents.zip && rm contents.zip

    Create a developer account with OpenAI

    1. Go to the API signup page.

    2. Create your account (you'll need to provide your email address and your phone number).

    1. Go to the API keys page.

    2. Create a new secret key.

    1. Take a copy of it. (If you lose it, delete the key and create a new one.)

    Add a payment method

    OpenAI sometimes provides free credits for the API, but it's not clear if that is worldwide or what the conditions are. You may need to add debit/credit card details.

    We will use 2 APIs:

    • The Chat API with the gpt-3.5-turbo model (cost: $0.002 / 1K tokens)
    • The Embedding API with the Ada v2 model (cost: $0.0004 / 1K tokens)

    In total, the Chat API (used for completions) should cost less than $0.1 and embedding should cost around $0.6. This notebook provides embeddings already, so you can skip the embedding step.

    1. Go to the Payment Methods page.

    2. Click Add payment method.

    1. Fill in your card details.

    Set up Environment Variables

    1. In Workspace, click on Environment.
    1. Click on the "Environment Variables" plus button.
    1. In the "Name" field, type OPENAI_API_KEY. In the "Value" field, paste in your secret key (starting with sk-)
    1. Click "Create", and connect the new integration.

    Task 0: Setup

    For the purpose of this training, we'll need to install a few packages:

    • langchain: The LangChain framework
    • chromadb: The package we'll use for the vector database
    • tiktoken: A tokenizer we'll use to count GPT-3 tokens
    # install langchain (version 0.0.191)
    !pip install langchain==0.0.191
    # install chromadb
    !pip install chromadb
    # install tiktoken
    !pip install tiktoken
    

    Task 1: Load data

    To be able to embed and store data, we need to provide LangChain with Documents. This is easy to achieve in LangChain thanks to Document Loaders. In our case, we're targeting a "Read the docs" documentation, for which there is a loader ReadTheDocsLoader. In the folder rtdocs, you'll find all the HTML files from the LangChain documentation (https://python.langchain.com/en/latest/index.html).

    How did we obtain the data

    These file were downloaded by executing this linux command:

    wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/
    

    We urge you **NOT** to execute this during the live training, as it will scan and download the full langchain doc site (~1000 files). This operation may be heavy and could disrupt the site, especially if hundreds of learners do it all at once!



    Our first task is to load these HTML files as documents that we can use with langchain: we're going to use the ReadTheDocsLoader. It will read the directory containing all HTML files and transform them into Document objects. ReadTheDocsLoader will read each HTML file, remove HTML tags to only keep the text and return it as a Document. At the end of this task, we'll have a variable raw_documents containing a list of Document: one Document per HTML file.

    Note that in this step we won't actually load the documents into a database, we're simply loading the documents in a list.

    Instructions

    1. import ReadTheDocsLoader from langchain.document_loaders
    2. Create the loader, pointing to the rtdocs/python.langchain.com/en/latest directory and enabling the HTML parser feature with features='html.parser'
    3. Load the data in raw_documents by calling loader.load()
    # Import ReadTheDocsLoader
    from langchain.document_loaders import ReadTheDocsLoader
    
    # Create a loader for the `rtdocs/python.langchain.com/en/latest` folder
    loader = ReadTheDocsLoader("rtdocs/python.langchain.com/en/latest", features="html.parser")
    
    # Load the data
    raw_documents = loader.load()

    Task 2: Slice the documents into smaller chunks

    In the previous step, we turned each HTML file into a Document. These files may be very long, and are potentially too large to embed fully. It's also a good practice to avoid embedding large documents:

    • long documents often contain several concepts. Retrieval will be easier if each concept is indexed separately;
    • retrieved documents will be injected in a prompt, so keeping them short will keep the prompt small(ish)

    LangChain has a collection of tools to do this: Text Splitters. In our case, we'll be using the most straightfoward one and simplest to use: the Recursive Character Text Splitter. The recursive text splitter will recursively reduce the input by splitting it by paragraph, then sentences, then words as needed until the chunk is small enough.

    Instructions

    1. Import the RecursiveCharacterTextSplitter from langchain.text_splitter
    2. Create a text splitter configured with chunk_size=1000 and chunk_overlap=200
      These values are arbitrary and you'll need to try different ones to see which best serve your use case
    3. split the raw_documents and store them as documents, using the .split_documents() method
    # Import RecursiveCharacterTextSplitter
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    # Create the text splitter
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    
    # Split the documents
    documents = splitter.split_documents(raw_documents)
    documents[0]