I listen to Legends of the Old West podcast, it's a western-themed episodic podcast centered around outlaws.The narrator is great, the character actions are descriptive but I'm left wanting more.
With the surge of Stable Diffusion projects, I was inspired to make something AI generated art themed.
What I ended up with is transcribing the podcast audio into text, and then generating images based off of that. Take a look below for an example.
The bulk of the work is done by Vosk, an offline open source speech recognition toolkit. We convert the input MP3 to wav, send it through Vosk and receive a generated JSON output file.
from vosk import Model, KaldiRecognizer
import wave
import json
wav_file = "tk150-split.wav"
model_path = "vosk-model-en-us-0.22"
model = Model(model_path)
wf = wave.open(wav_file, "rb")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)
text_lst = []
while True:
data = wf.readframes(4000)
if len(data) == 0:
break
if rec.AcceptWaveform(data):
word = json.loads(rec.Result())["text"]
if len(word) > 0:
text_lst.append(word)
print("....new sentence...")
else:
print("....processing.......")
if text_lst != 0:
with open("output-audio-file.txt", "w") as filehandle:
json.dump(text_lst, filehandle)
With our output transcribed text, we can run it through whatever AI generating service we desire. For the purpose of speed, I used Lexica.art.import requests
import json
import time
# Get lexica AI generated art by prompt
def image(prompt):
"""Serve the image page."""
results = requests.get("https://lexica.art/api/v1/search", params={"q": prompt})
if results.status_code != 200:
print("Requested URL: %s", results.url)
print("Content: %s", results.content)
results.raise_for_status()
results = results.json()
if results and results["images"]:
response = {
"src": results["images"][0]["src"],
"alt": results["images"][0]["prompt"],
}
return response
return {}
# Open output file for reading
with open("output-audio-file.txt", "r") as filehandle:
output_audio_text = json.load(filehandle)
# Generate images per each value from our transcribed audio
images_list = []
for line in output_audio_text:
a = image(line)
images_list.append(a)
time.sleep(1)
# Write image lists to file
if images_list != 0:
with open("images-generated.json", "w") as f:
json.dump(images_list, f, ensure_ascii=False)
For the demo, I used Flask to create a small web app and serve the images and text.
Code on Github: podcast-image
Attachment | Size |
---|---|
Image 2022-10-29 at 01.06.jpeg | 617.37 KB |
screenshot-podcast-image.jpg | 167.11 KB |