Step by Step Guide to Generate Podcasts using TTS and LLMs

Step by Step Guide to Generate Podcasts using TTS and LLMs

Mathavan
AI Generated Podcasts

What is AI-Generated Podcast

AI-generated podcasts represent a fusion of artificial intelligence technologies with traditional audio content creation, fundamentally transforming how podcasts are produced, distributed, and consumed. These podcasts use machine learning models, especially large language models (LLMs), to synthesize audio content that can range from entirely scripted to dynamically generated conversations. AI can simulate voices, personalities, and discussion styles of specific individuals, whether they are real-life figures, historical personalities, or fictional characters.

Advanced text-to-speech (TTS) technologies and voice cloning tools are used to generate lifelike audio from text scripts. These tools can mimic specific voice characteristics, making it possible to produce episodes that sound like they are hosted by any chosen personality or the topic of the episode which are chosen by the user. This customisation helps to engage listeners in real-time, encouraging a deeper sense of connection and community around the podcast.

AI's Impact on Podcast Content Creation

One of the most labor-intensive aspects of podcast production is the research and scriptwriting phase, which involves gathering information, fact-checking, and crafting a narrative that is both informative and engaging. AI can handle these tasks efficiently by quickly processing vast amounts of data and generating content that adheres to a given outline or set of topics.

AI can scour multiple data sources, including recent news articles, academic papers, and other relevant materials, to compile detailed background information on chosen topics. This capability is particularly useful for podcasts that cover complex subjects requiring extensive background research, such as scientific developments, historical events, or current affairs. By automating the initial research and draft creation, AI allows podcast creators to focus more on refining the content, adding personal insights, and engaging with their audience.

Why AI-Podcast Generation

User-Defined Guests

AI allows for the creation of virtual episodes where listeners can choose their preferred personalities to feature, be they renowned historical figures or modern celebrities. This personalization enables fans to hear from figures who are no longer accessible, like past leaders, or from current figures who are otherwise unreachable due to their busy schedules. This flexibility enhances listener engagement by making the experience more tailored and interactive.

Exploring Unanswered Questions

AI facilitates conversations on topics that are either too niche for mainstream podcasts or require expertise that would be difficult to assemble in a traditional setting. For example, an AI could simulate a discussion between Nikola Tesla and Elon Musk on energy futures, a conversation impossible in reality but rich with educational and entertainment value. The generation of voice totally depends on the quality of training audio data available for the particular person so we must take into account it while choosing the speaker and guest.

Simulating Controversial Conversations

Many topics are too sensitive or divisive for real people to tackle in a public format due to potential backlash or political correctness. AI can navigate these issues by simulating dialogues that explore these areas without personal risk to the speakers. For example, an AI-generated podcast could simulate a conversation where U.S. President Joe Biden discusses his regrets and reflections on the consequences of military actions in Rafah, particularly addressing the loss of civilian lives during recent conflicts in Palestine. Such a dialogue, while entirely hypothetical, would allow for a nuanced exploration of the moral and strategic complexities involved, which might be too controversial for a sitting president to address openly. This not only broadens the scope of discussion but also allows for a deeper exploration of complex themes without real-world repercussions, providing listeners with insights into critical global issues that are rarely discussed in such an open and personal manner.

Bringing Fiction to Life

AI can create podcasts where fictional characters from beloved books or films interact in new scenarios, extending the universe of a story beyond its original media. For example, imagine a podcast episode where Naruto Uzumaki from "Naruto" and Izuku Midoriya from "My Hero Academia'' discuss the nature of heroism and the responsibilities that come with power. Such interactions allow fans to enjoy fresh content with familiar characters, enhancing their connection to the narrative and exploring 'what if' scenarios that fuel fan theories and discussions. This unique content generation leverages AI to delve into creative discussions that would not be possible in the original works, offering new insights and entertainment to the audience.

Legacy Voices Preservation

Through voice cloning technology, AI can preserve and replicate the voices of deceased or aging personalities, making it possible for them to continue 'participating' in new dialogues. This not only keeps their legacies alive but also makes historical education more engaging by allowing students to 'hear' history from the very individuals who made it. For example, giving life to Martin Luther King Jr a prominent civil rights leader, King is remembered for his powerful speeches.

How to Generate AI Podcasts

Knowledge Base Creation

The process begins by identifying and downloading the audio of YouTube videos. This is achieved using the youtube_dl library, which is a command-line program to download videos from YouTube and a few more sites, get_channel_id_by_username and get_youtube_videos-these functions use the YouTube Data API v3, accessed via RapidAPI's interface. Users input the name of a YouTube channel, and the API returns videos based on specified criteria such as minimum duration and video count.The download_audio function utilizes youtube_dl to download the best audio quality available and saves it as an MP3 file in a specified output directory. This is crucial for obtaining clear audio for transcription.

Code-


import os
from transcription_utils import get_channel_id_by_username, download_audio, get_video_details, get_video_urls, get_youtube_videos, process_audio_files

def main():
    # Get user inputs interactively
    channel_name = input("Enter the name of the channel: ")
    min_minutes = int(input("Enter the minimum minutes of the videos: "))
    count = int(input("Enter the number of videos: "))
    output_folder = input("Enter the output folder name where audio files will be saved: ")

    channel_id = get_channel_id_by_username(channel_name)
    print("The channel id:", channel_id)

    videos = get_youtube_videos(channel_id, min_minutes, count)
    print("The videos are: ", videos)
    urls = get_video_urls(videos)

    # Ensure the output folder is created at the project root level
    project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    output_path = os.path.join(project_root, output_folder)
    os.makedirs(output_path, exist_ok=True)

    for url in urls:
        download_audio(url, output_folder=output_path)

    # Create a transcriptions directory and ensure each transcription is saved in its own folder
    transcriptions_root = os.path.join(project_root, 'transcriptions')
    os.makedirs(transcriptions_root, exist_ok=True)

    transcription_folder = f'transcriptions_of_{output_folder}'
    transcription_path = os.path.join(transcriptions_root, transcription_folder)
    os.makedirs(transcription_path, exist_ok=True)
    
    process_audio_files(output_path, transcription_path)
    print("Transcription generated successfully")

if __name__ == "__main__":
    main()

Some of the new listing youtube videos or the channels  are not being found via the youtube v3 api calls therefore the below script  will be able to allow the user download the particular video in MP3 format via the URL provided by the user.


import argparse
import os
from transcription_utils import download_audio, process_audio_files

def main(urls, output_folder):
    # Ensure the output folder is created at the project root level
    project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    output_path = os.path.join(project_root, output_folder)
    os.makedirs(output_path, exist_ok=True)

    for url in urls:
        download_audio(url.strip(), output_folder=output_path)  # Ensure each URL is stripped of whitespace

    # Create a transcriptions directory and ensure each transcription is saved in its own folder
    transcriptions_root = os.path.join(project_root, 'Transcriptions')
    os.makedirs(transcriptions_root, exist_ok=True)

    transcription_folder = f'transcriptions_of_{output_folder}'
    transcription_path = os.path.join(transcriptions_root, transcription_folder)
    os.makedirs(transcription_path, exist_ok=True)
    
    process_audio_files(output_path, transcription_path)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Download and process audio files from YouTube URLs.')
    parser.add_argument('-o', '--output', required=True, help='The output folder name where audio files will be saved')
    parser.add_argument('--urls', required=True, help='A comma-separated list of YouTube URLs to download and process')
    args = parser.parse_args()

    # Split the URLs by comma and strip any extra whitespace
    urls = [url.strip() for url in args.urls.split(',')]

    main(urls, args.output)

Dialogue Generation

To produce dialogues that are both natural-sounding and engaging, leveraging advanced LLMs (Large Language Models) involves a specific set of steps and technologies. Below, I'll outline how the provided code achieves dialogue generation, specifically detailing the role of these models and their customization to reflect distinct speech patterns and personality traits.

Text Splitting and Data Preparation

The audio transcripts received from assembly ai are loaded in this stage where the texts are retrieved for creating a vector store for helping the LLM to inherit the personality of the speaker from the knowledge base.

Before generating the vector store, the text data (such as transcripts) must be properly formatted. The get_text_chunks function is designed to read text files from the transcript directory, combine their contents for a particular speaker file into a single string, and then split this string into smaller chunks for further processing. This step is crucial for processing large datasets without overwhelming the model.

Code-

Embedding and Index Creation

Using the SentenceTransformer model, text chunks are converted into vector embeddings. These embeddings represent the textual data in a high-dimensional space, capturing semantic meanings that are used for generating dialogue. Chroma DB is utilized to efficiently store and retrieve these embeddings during the dialogue generation process, allowing the model to access relevant context quickly.


def get_vectorstore(text_chunks, collection_name):
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    embeddings = model.encode(text_chunks, convert_to_tensor=False, show_progress_bar=True)
    
    # Initialize Chroma client with PersistentClient
    chroma_client = chromadb.PersistentClient(path="./chroma")

    # Create a collection in Chroma
    collection = chroma_client.create_collection(collection_name)

    # Add the text chunks and their embeddings to the Chroma collection
    for i, chunk in enumerate(text_chunks):
        collection.add(
            documents=[chunk],
            metadatas=[{"source": f"chunk_{i}"}],
            ids=[str(i)],
            embeddings=[embeddings[i].tolist()]
        )
    
    return collection,model

import os
from sentence_transformers import SentenceTransformer
import chromadb
from podcast_utils import get_text_chunks, get_vectorstore
from dotenv import load_dotenv

load_dotenv()

def setup_chroma_directory(chroma_directory, transcriptions_folder):
    # Read guest and speaker names from environment variables
    guest_name = os.getenv('GUEST_NAME')
    speaker_name = os.getenv('SPEAKER_NAME')

    if not guest_name or not speaker_name:
        raise ValueError("Guest and Speaker names must be set in the environment variables.")

    # Define the paths for the guest and speaker transcriptions
    transcripts_location_guest = os.path.join(transcriptions_folder, f"transcriptions_of_{guest_name}")
    transcripts_location_speaker = os.path.join(transcriptions_folder, f"transcriptions_of_{speaker_name}")

    print(f"Guest transcription path: {transcripts_location_guest}")
    print(f"Speaker transcription path: {transcripts_location_speaker}")

    if not os.path.exists(transcripts_location_guest) or not os.path.exists(transcripts_location_speaker):
        print("Contents of the Transcriptions folder:")
        for item in os.listdir(transcriptions_folder):
            print(item)
        raise ValueError(f"Transcription directories for {guest_name} and {speaker_name} must exist.")

    if not os.path.exists(chroma_directory):
        # Get text chunks from transcripts for both LLMs
        text_chunks_guest = get_text_chunks(transcripts_location_guest)
        text_chunks_speaker = get_text_chunks(transcripts_location_speaker)

        # Create vector stores for both LLMs with unique collection names
        print("\n \n Creating vector stores for GUEST")
        collection_guest, model_guest = get_vectorstore(text_chunks_guest, "podcast_conversations_guest")
        print("\n \n Creating vector stores for SPEAKER")
        collection_speaker, model_speaker = get_vectorstore(text_chunks_speaker, "podcast_conversations_speaker")

        # Model assignment
        model = model_guest  # Assuming both models are the same
    else:
        # Initialize Chroma client with PersistentClient
        chroma_client = chromadb.PersistentClient(path=chroma_directory)
        
        collection_guest = chroma_client.get_collection("podcast_conversations_guest")
        collection_speaker = chroma_client.get_collection("podcast_conversations_speaker")
        
        model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    return collection_guest, collection_speaker, model

if __name__ == "__main__":
    # Get the script's directory
    script_dir = os.path.dirname(os.path.abspath(__file__))
    transcriptions_folder = os.path.join(script_dir, 'Transcriptions')
    chroma_directory = os.path.join(script_dir, 'chroma')

    collection_guest, collection_speaker, model = setup_chroma_directory(
        chroma_directory=chroma_directory,
        transcriptions_folder=transcriptions_folder
    )

    print("Vector store built successfully.")

Enhancing  Conversations with Personalized Prompts

The use of customized prompts is pivotal in guiding a language model (LLM) to mimic specific personalities effectively, enhancing the quality and relevance of AI-generated responses. As depicted in the diagram, the Speaker_prompt, Guest_prompt, Personality_guide_speaker, and Personality_guide_guest are tailored to fine-tune the model conversational style and characteristics for different personas. These prompts provide structured guidance, allowing the model to generate responses that are contextually appropriate and align with the predefined traits of the speaker and guest. 

By customizing these prompts, users can ensure that the AI maintains a consistent and engaging dialogue, reflecting the nuances of the specified personalities. This approach not only improves the relatability and authenticity of the interaction but also significantly enhances the overall user experience by delivering responses that feel more natural and human-like. Whereas the master and judge prompts are the basic outline on how these models should generate the personalities therefore customisation is strictly not necessary.

Generating Dialogue

In the process of generating dialogue for AI-driven podcasts, several steps and components work together to create a coherent and engaging conversation. The script begins by loading necessary configurations and personality guides, which are crucial for shaping the responses to match the defined personas of the speaker and guest. These guides are read from text files and environment variables, ensuring that each interaction is personalized and contextually appropriate.

The dialogue generation starts with the creation of master prompts for both the speaker and guest using the `get_speaker_master_prompt` and `get_guest_master_prompt` functions, which incorporate the discussion topic provided by the user. The total number of messages for the conversation is set, and initial message arrays for both the speaker and guest are established.

During each iteration of the conversation loop, the script first checks and redacts older messages to maintain a manageable history size. It then generates responses using the `get_llm_response` function, which interacts with the LLM through the OpenAI API. The responses are processed to extract "think" and "out" texts using the `extract_out_think_texts` and `extract_out_text` functions, respectively.

These responses are then embedded using a sentence transformer model, and relevant personality information is retrieved from pre-built vector stores (`collection_guest` and `collection_speaker`). This information is used to adjust the prompts dynamically, ensuring the dialogue remains contextually relevant and true to the defined personas. The `judge_response` function further refines the responses by evaluating them against a set of predefined criteria and adjusting them for coherence and personality consistency. The cleaned and judged responses are appended to the conversation log, which captures raw responses, think texts, and the final output for each message. Finally, the complete conversation log is saved to a JSON file with a timestamped filename.

Eleven Labs Voice Cloning for Podcast Production

The process of using Eleven Labs for voice cloning in podcast production involves several steps, from setting up the API to integrating custom voice settings for generating dynamic audio content. This technology allows podcast creators to use realistic synthesized voices that can be tailored to specific characters or personalities, enhancing the auditory experience of podcasts.

Initialization and API Setup

First, the Eleven Labs API client is initialized using an API key. This client facilitates all interactions with the Eleven Labs services, including retrieving available voices and generating audio.The API allows you to retrieve a list of available voices. This can be used to select a voice that matches the character or personality you wish to clone.


from elevenlabs.client import ElevenLabs

client = ElevenLabs(
    api_key="",
)

response = client.voices.get_all()
print(response.voices)  # Displays all available voices

Build Custom Voice

Eleven Labs provides a cloning functionality  from the starter pack of its subscription where the user can clone the targeted voice using a few clear cut audio samples. You can access this feature and run a demo by accessing directly on their website voice lab section or via code.


voice = client.clone(
    name="Alex",
    description="An old American male voice with a slight hoarseness in his throat. Perfect for news", # Optional
    files=["./sample_0.mp3", "./sample_1.mp3", "./sample_2.mp3"],
)

audio = client.generate(text="Hi! I'm a cloned voice!", voice=voice)

play(audio)

Find the Best Fit Parameters

When utilizing voice cloning technology for podcast production, particularly with Eleven Labs, it's essential to adjust the voice settings to closely match the desired vocal characteristics of the characters being emulated. This ensures that the generated audio maintains a high degree of realism and fidelity to the original voice. Configuring the correct parameters in VoiceSettings is crucial for achieving the best quality and most authentic-sounding cloned voice. Each parameter plays a specific role in how the voice will sound.

stability: This parameter controls the consistency of the voice output. A higher stability value ensures that the voice does not fluctuate too much, maintaining a consistent tone throughout the dialogue. For example, setting stability to 0.60 provides a balance between natural variation and consistency.

similarity_boost: This setting adjusts how closely the cloned voice matches the target voice. A higher similarity_boost increases the likelihood that the voice sounds like the intended person. For voice 1, a similarity boost of 0.95 suggests a strong resemblance to the original voice, whereas voice 2 setting at 0.80 allows for slightly more deviation, which might be useful for character voices where exact replication isn't critical.

style: This parameter influences the dynamic range and expressiveness of the voice. A lower style value might make the voice sound more monotone, while a higher value increases expressiveness. At 0.20, the style is moderately expressive, suitable for general podcast dialogues where too much expressiveness could distract from the content.

use_speaker_boost: When set to True, this enhances the clarity and presence of the voice, making it stand out more clearly in the audio mix. This is particularly useful in podcast production where voice clarity is paramount for listener engagement but note it also excessively reduces the credits in your account.


from elevenlabs.client import Voice, VoiceSettings

Voice_1_voice_settings = VoiceSettings(
    stability=0.60,
    similarity_boost=0.95,
    style=0.20,
    use_speaker_boost=True
)

Voice_2_voice_settings = VoiceSettings(
    stability=0.60,
    similarity_boost=0.80,
    style=0.20,
    use_speaker_boost=True
)

Generating and Combining Audio Segments for a Podcast

The process of creating a podcast episode using AI-generated voices involves two main steps: generating individual audio segments for each part of the conversation and then combining these segments into a single coherent audio file. First, the audio for each segment of the podcast is generated using the Eleven Labs API. The  code facilitates this by loading a conversation log, applying the appropriate voice settings for the speaker and guest, and generating audio segments accordingly. The get_voice function handles the generation, where it initializes a combined audio segment and iteratively adds audio for each dialogue entry. Voice settings such as stability, similarity boost, and style are tailored for the speaker and guest to ensure distinct and realistic character voices. Each generated audio segment is combined sequentially, with random delays added to simulate natural conversation pauses. Finally, the combined audio file is saved with a timestamped filename in a structured directory, enhancing organization and ease of access for podcast production.


from elevenlabs import play
from elevenlabs.client import ElevenLabs, VoiceSettings, Voice
from pydub import AudioSegment
import io
import random
import os
from dotenv import load_dotenv
from datetime import datetime

# Load environment variables
load_dotenv()

client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY"))

# Initialize an empty audio segment
combined_audio = AudioSegment.empty()

# Define voice settings 
speaker_voice_settings = VoiceSettings(
    stability=0.45,
    similarity_boost=0.75,
    style=0.2,
    # use_speaker_boost=True
)

# Define voice settings 
guest_voice_settings = VoiceSettings(
    stability=0.5,
    similarity_boost=0.60,
    style=0.35,
    # use_speaker_boost=True
)

def get_voice(conversations, speaker_voice_id, guest_voice_id, speaker_name, guest_name):
    combined_audio = AudioSegment.empty()
    for entry in conversations:
        if 'speaker' in entry:
            voice_id = speaker_voice_id
            settings = speaker_voice_settings
            text = entry['speaker']
            speaker = speaker_name
        elif 'guest' in entry:
            voice_id = guest_voice_id
            settings = guest_voice_settings
            text = entry['guest']
            speaker = guest_name
        else:
            continue  # Skip if speaker is not recognized

        # Generate audio with custom voice settings
        audio_generator = client.generate(
            text=text,
            voice=Voice(voice_id=voice_id, settings=settings),
            model="eleven_multilingual_v2"
        )

        # Initialize a bytes buffer
        buffer = io.BytesIO()
        # Collect all data from the generator
        for chunk in audio_generator:
            buffer.write(chunk)
        # Move back to the beginning of the BytesIO buffer
        buffer.seek(0)
        # Load the audio segment from the buffer
        audio_segment = AudioSegment.from_file(buffer, format="mp3")
        combined_audio += audio_segment  # Combine sequentially as they appear
        print(f"Audio segment added for {speaker}.")

        # Add a random delay between 0.2 to 0.5 seconds
        delay_duration = random.uniform(0.2, 0.5) * 1000  # Convert to milliseconds
        silent_segment = AudioSegment.silent(duration=delay_duration)
        combined_audio += silent_segment

    # Generate filename with timestamp and speaker/guest names
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{timestamp}_{speaker_name}_conversation_with_{guest_name}.mp3"
    
    # Ensure the output folders exist
    podcast_out_folder = os.path.join("generated_podcast", "podcast_out")
    os.makedirs(podcast_out_folder, exist_ok=True)
    
    # Output the combined audio to the file
    audio_file_path = os.path.join(podcast_out_folder, filename)
    combined_audio.export(audio_file_path, format="mp3")
    print(f"Audio file saved to {audio_file_path}")

    return filename, audio_file_path

Elevate Your Podcasting Experience with AI-Powered Solutions

If you're ready to harness the innovative power of AI for your podcast production, our team at Mercity.ai is here to help. From crafting engaging, AI-generated dialogues to simulating conversations with historical figures or fictional characters, we offer custom AI podcasting solutions tailored to your creative needs. Whether you aim to explore untouched topics, bring to life the voices of the past, or captivate your audience with unique, personalized content, Mercity.ai has the expertise to transform your podcasting aspirations into reality. Don't miss out on the opportunity to revolutionize your podcast production. Contact Mercity.ai today to discover how our advanced AI technologies can enhance your storytelling and engage your audience like never before. Reach out now!

Subscribe to stay informed

Subscribe to our newsletter to stay updated on all things AI!
Subscribe
Awesome, you subscribed!
Error! Please try again.