TTSFM API Documentation

Overview

The TTSFM API provides a modern, OpenAI-compatible interface for text-to-speech generation. It supports multiple voices, audio formats, and includes advanced features like text length validation and intelligent auto-combine functionality.

Base URL: http://tts.isp.skin/api/

Key Features

🎤 11 different voice options - Choose from alloy, echo, nova, and more
🎵 Multiple audio formats - MP3, WAV, OPUS, AAC, FLAC, PCM support
🤖 OpenAI compatibility - Drop-in replacement for OpenAI's TTS API
✨ Auto-combine feature - Automatically handles long text (>4096 chars) by splitting and combining audio
📊 Text length validation - Smart validation with configurable limits
📈 Real-time monitoring - Status endpoints and health checks

New in v3.2.9: Docker images now bind to 0.0.0.0 automatically, fixing localhost-only launches and 502 errors. Docs were refreshed with Host override guidance for locked-down deployments.

Authentication

Currently, the API supports optional API key authentication. If configured, include your API key in the request headers.

Authorization: Bearer YOUR_API_KEY

Text Length Validation

TTSFM includes built-in text length validation to ensure compatibility with TTS models. The default maximum length is 4096 characters, but this can be customized.

Important: Text exceeding the maximum length will be rejected unless validation is disabled or the text is split into chunks.

Validation Options

max_length: Maximum allowed characters (default: 4096)
validate_length: Enable/disable validation (default: true)
preserve_words: Avoid splitting words when chunking (default: true)

API Endpoints

GET /api/voices

Get list of available voices.

Response Example:

{
  "voices": [
    {
      "id": "alloy",
      "name": "Alloy",
      "description": "Alloy voice"
    },
    {
      "id": "echo",
      "name": "Echo", 
      "description": "Echo voice"
    }
  ],
  "count": 6
}

GET /api/formats

Get available audio formats for speech generation.

Available Formats

We support multiple format requests, but internally:

mp3 - Returns actual MP3 format
All other formats (opus, aac, flac, wav, pcm) - Mapped to WAV format

Note: When you request opus, aac, flac, wav, or pcm, you'll receive WAV audio data.

Response Example:

{
  "formats": [
    {
      "id": "mp3",
      "name": "MP3",
      "mime_type": "audio/mp3",
      "description": "MP3 audio format"
    },
    {
      "id": "opus", 
      "name": "Opus",
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    },
    {
      "id": "aac",
      "name": "AAC", 
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    },
    {
      "id": "flac",
      "name": "FLAC",
      "mime_type": "audio/wav", 
      "description": "Returns WAV format"
    },
    {
      "id": "wav",
      "name": "WAV",
      "mime_type": "audio/wav",
      "description": "WAV audio format"
    },
    {
      "id": "pcm",
      "name": "PCM",
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    }
  ],
  "count": 6
}

POST /api/validate-text

Validate text length and get splitting suggestions.

Request Body:

{
  "text": "Your text to validate",
  "max_length": 4096
}

Response Example:

{
  "text_length": 5000,
  "max_length": 4096,
  "is_valid": false,
  "needs_splitting": true,
  "suggested_chunks": 2,
  "chunk_preview": [
    "First chunk preview...",
    "Second chunk preview..."
  ]
}

POST /api/generate

Generate speech from text.

Request Body:

{
  "text": "Hello, world!",
  "voice": "alloy",
  "format": "mp3",
  "instructions": "Speak cheerfully",
  "max_length": 4096,
  "validate_length": true
}

Parameters:

text (required): Text to convert to speech
voice (optional): Voice ID (default: "alloy")
format (optional): Audio format (default: "mp3")
instructions (optional): Voice modulation instructions
max_length (optional): Maximum text length (default: 4096)
validate_length (optional): Enable validation (default: true)

Response:

Returns audio file with appropriate Content-Type header.

Python Package

Long Text Support

The TTSFM Python package includes built-in long text splitting functionality for developers who need fine-grained control:

from ttsfm import TTSClient, Voice, AudioFormat

# Create client
client = TTSClient()

# Generate speech from long text (automatically splits into separate files)
responses = client.generate_speech_long_text(
    text="Very long text that exceeds 4096 characters...",
    voice=Voice.ALLOY,
    response_format=AudioFormat.MP3,
    max_length=2000,
    preserve_words=True
)

# Save each chunk as separate files
for i, response in enumerate(responses, 1):
    response.save_to_file(f"part_{i:03d}.mp3")

Developer Features:

Manual Splitting: Full control over text chunking for advanced use cases
Word Preservation: Maintains word boundaries for natural speech
Separate Files: Each chunk saved as individual audio file
CLI Support: Use `--split-long-text` flag for command-line usage

Note: For web users, the auto-combine feature in `/v1/audio/speech` is recommended as it automatically handles long text and returns a single seamless audio file.

POST /api/generate-combined

Generate a single combined audio file from long text. Automatically splits text into chunks, generates speech for each chunk, and combines them into one seamless audio file.

Request Body:

{
  "text": "Very long text that exceeds the limit...",
  "voice": "alloy",
  "format": "mp3",
  "instructions": "Optional voice instructions",
  "max_length": 4096,
  "preserve_words": true
}

Response:

Returns a single audio file containing all chunks combined seamlessly.

Response Headers:

X-Chunks-Combined: Number of chunks that were combined
X-Original-Text-Length: Original text length in characters
X-Audio-Size: Final audio file size in bytes

POST /v1/audio/speech

Enhanced OpenAI-compatible endpoint with auto-combine feature. Automatically handles long text by splitting and combining audio chunks when needed.

Request Body:

{
  "model": "gpt-4o-mini-tts",
  "input": "Text of any length...",
  "voice": "alloy",
  "response_format": "mp3",
  "instructions": "Optional voice instructions",
  "speed": 1.0,
  "auto_combine": true,
  "max_length": 4096
}

Enhanced Parameters:

auto_combine (boolean, default: true):
- true: Automatically split long text and combine audio chunks into a single file
- false: Return error if text exceeds max_length (standard OpenAI behavior)
max_length (integer, default: 4096): Maximum characters per chunk when splitting

Response Headers:

X-Auto-Combine: Whether auto-combine was enabled (true/false)
X-Chunks-Combined: Number of audio chunks combined (1 for short text)
X-Original-Text-Length: Original text length (for long text processing)
X-Audio-Format: Audio format of the response
X-Audio-Size: Audio file size in bytes

docs.examples_title

# Short text (works normally)
curl -X POST http://tts.isp.skin/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Hello world!",
    "voice": "alloy"
  }'

# Long text with auto-combine (default)
curl -X POST http://tts.isp.skin/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Very long text...",
    "voice": "alloy",
    "auto_combine": true
  }'

# Long text without auto-combine (will error)
curl -X POST http://tts.isp.skin/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Very long text...",
    "voice": "alloy",
    "auto_combine": false
  }'

Audio Combination: Uses advanced audio processing (PyDub) when available, with intelligent fallbacks for different environments. Supports all audio formats.

Use Cases:

Long Articles: Convert blog posts or articles to single audio files
Audiobooks: Generate chapters as single audio files
Podcasts: Create podcast episodes from scripts
Educational Content: Convert learning materials to audio

Example Usage:

# Python example
import requests

response = requests.post(
    "http://tts.isp.skin/api/generate-combined",
    json={
        "text": "Your very long text content here...",
        "voice": "nova",
        "format": "mp3",
        "max_length": 2000
    }
)

if response.status_code == 200:
    with open("combined_audio.mp3", "wb") as f:
        f.write(response.content)

    chunks = response.headers.get('X-Chunks-Combined')
    print(f"Combined {chunks} chunks into single file")

WebSocket Streaming

Real-time audio streaming for enhanced user experience. Get audio chunks as they're generated instead of waiting for the complete file.

WebSocket streaming provides lower perceived latency and real-time progress tracking for TTS generation.

Connection

// JavaScript WebSocket client
const client = new WebSocketTTSClient({
    socketUrl: 'http://tts.isp.skin',
    debug: true
});

// Connection events
client.onConnect = () => console.log('Connected');
client.onDisconnect = () => console.log('Disconnected');

Streaming TTS Generation

// Generate speech with real-time streaming
const result = await client.generateSpeech('Hello, WebSocket world!', {
    voice: 'alloy',
    format: 'mp3',
    chunkSize: 1024,  // Characters per chunk
    
    // Progress callback
    onProgress: (progress) => {
        console.log(`Progress: ${progress.progress}%`);
        console.log(`Chunks: ${progress.chunksCompleted}/${progress.totalChunks}`);
    },
    
    // Receive audio chunks in real-time
    onChunk: (chunk) => {
        console.log(`Received chunk ${chunk.chunkIndex + 1}`);
        // Process or play audio chunk immediately
        processAudioChunk(chunk.audioData);
    },
    
    // Completion callback
    onComplete: (result) => {
        console.log('Streaming complete!');
        // result.audioData contains the complete audio
    }
});

WebSocket Events

Client → Server Events

Event	Description	Payload
`generate_stream`	Start TTS generation	`{text, voice, format, chunk_size}`
`cancel_stream`	Cancel active stream	`{request_id}`

Server → Client Events

Event	Description	Payload
`stream_started`	Stream initiated	`{request_id, timestamp}`
`audio_chunk`	Audio chunk ready	`{request_id, chunk_index, audio_data, duration}`
`stream_progress`	Progress update	`{progress, chunks_completed, total_chunks}`
`stream_complete`	Generation complete	`{request_id, total_chunks, status}`
`stream_error`	Error occurred	`{request_id, error, timestamp}`

Benefits

Real-time feedback: Users see progress as audio generates
Lower latency: First audio chunk arrives quickly
Cancellable: Stop generation mid-stream if needed
Efficient: Process chunks as they arrive

Example: Streaming Audio Player

// Create a streaming audio player
const audioChunks = [];
let isPlaying = false;

const streamingPlayer = await client.generateSpeech(longText, {
    voice: 'nova',
    format: 'mp3',
    
    onChunk: (chunk) => {
        // Store chunk
        audioChunks.push(chunk.audioData);
        
        // Start playing after first chunk
        if (!isPlaying && audioChunks.length >= 3) {
            startStreamingPlayback(audioChunks);
            isPlaying = true;
        }
    },
    
    onComplete: (result) => {
        // Ensure all chunks are played
        finishPlayback(result.audioData);
    }
});

Try It Out!

Experience WebSocket streaming in action at the WebSocket Demo or enable streaming mode in the Playground.

API Documentation

Contents

Overview

Key Features

Authentication

Text Length Validation

Validation Options

API Endpoints

GET /api/voices

Response Example:

GET /api/formats

Available Formats

Response Example:

POST /api/validate-text

Request Body:

Response Example:

POST /api/generate

Request Body:

Parameters:

Response:

Python Package

Long Text Support

Developer Features:

POST /api/generate-combined

Request Body:

Response:

Response Headers:

POST /v1/audio/speech

Request Body:

Enhanced Parameters:

Response Headers:

docs.examples_title

Use Cases:

Example Usage:

WebSocket Streaming

Connection

Streaming TTS Generation

WebSocket Events

Client → Server Events

Server → Client Events

Benefits

Example: Streaming Audio Player

Try It Out!