Audiobook narration trial with ElevenLabs

I’ve been exploring ElevenLabs TTS for synthesized audiobook narration. ElevenLabs have recently announced a community initiative encouraging discussion about the project (the incentive being TTS character allocation). Since this post has been simmering in my head for a while anyway, I figure now’s a good a time as any.

So far, this is the best-quality voice option I’ve found, but it is not perfect and a high-quality result requires time and effort. My main wish to make the service really practical for this purpose is granular sentence or even word-level configurability.

ElevenLabs has rolled out their “projects” to narrate long-form text. But for me, this has not been practical. Pacing, intonation, emotion, etc are not easily configurable in the service, and narrating large swaths of text at a time produces a subpart result. What I would love is the ability to highlight certain words within a project and configure them: add mood modifiers, for example. Or add pacing modifiers. I realize this can seem like a tall ask, but I really believe being able to issue granular direction before generation is the last big thing missing from TTS services. ElevenLabs has the voice quality down: now we just need to be able to control the thing.

What I do currently instead of using Projects is use the normal Speech Synthesis page to narrate 1-2 sentences at a time. I use modifiers or dialogue tags to the text to elicit a certain tone or emotion (these modifiers then get cut in my DAW).

Example:

"What do you think you're doing?"

Might turn into:

She laughed, ending on a high note: "What do you think you're doing?"

This is hit-and-miss, but does make a positive difference to my output.

In my audio editor, I then cut out the part of the output that says “She laughed, ending on a high note” to end up with the original dialogue only.

The downside is that this wastes tokens, since you’re not just paying for the original dialogue but the modifier text as well. Additionally, sometimes I end up having to regenerate the same line 5-10 times to get the exact voice quality I want. It can definitely add up.

TL;DR: ElevenLabs voice quality is so far unmatched for me. Granular pre-generation configurability is the last (big) piece of the puzzle currently missing.

Jan 23, 2024

writing