API Documentation

Complete reference for the TTSFM Text-to-Speech API. Free, simple, and powerful.

Overview

The TTSFM API provides a modern, OpenAI-compatible interface for text-to-speech generation. It supports multiple voices, audio formats, and includes advanced features like text length validation and intelligent auto-combine functionality.

Base URL: http://tts.isp.skin/api/

Key Features

  • 🎀 11 different voice options - Choose from alloy, echo, nova, and more
  • 🎡 Multiple audio formats - MP3, WAV, OPUS, AAC, FLAC, PCM support
  • πŸ€– OpenAI compatibility - Drop-in replacement for OpenAI's TTS API
  • ✨ Auto-combine feature - Automatically handles long text (>4096 chars) by splitting and combining audio
  • πŸ“Š Text length validation - Smart validation with configurable limits
  • πŸ“ˆ Real-time monitoring - Status endpoints and health checks
New in v3.2.9: Docker images now bind to 0.0.0.0 automatically, fixing localhost-only launches and 502 errors. Docs were refreshed with Host override guidance for locked-down deployments.

Authentication

Currently, the API supports optional API key authentication. If configured, include your API key in the request headers.

Authorization: Bearer YOUR_API_KEY

Text Length Validation

TTSFM includes built-in text length validation to ensure compatibility with TTS models. The default maximum length is 4096 characters, but this can be customized.

Important: Text exceeding the maximum length will be rejected unless validation is disabled or the text is split into chunks.

Validation Options

  • max_length: Maximum allowed characters (default: 4096)
  • validate_length: Enable/disable validation (default: true)
  • preserve_words: Avoid splitting words when chunking (default: true)

API Endpoints

GET /api/voices

Get list of available voices.

Response Example:
{
  "voices": [
    {
      "id": "alloy",
      "name": "Alloy",
      "description": "Alloy voice"
    },
    {
      "id": "echo",
      "name": "Echo", 
      "description": "Echo voice"
    }
  ],
  "count": 6
}

GET /api/formats

Get available audio formats for speech generation.

Available Formats

We support multiple format requests, but internally:

  • mp3 - Returns actual MP3 format
  • All other formats (opus, aac, flac, wav, pcm) - Mapped to WAV format
Note: When you request opus, aac, flac, wav, or pcm, you'll receive WAV audio data.
Response Example:
{
  "formats": [
    {
      "id": "mp3",
      "name": "MP3",
      "mime_type": "audio/mp3",
      "description": "MP3 audio format"
    },
    {
      "id": "opus", 
      "name": "Opus",
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    },
    {
      "id": "aac",
      "name": "AAC", 
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    },
    {
      "id": "flac",
      "name": "FLAC",
      "mime_type": "audio/wav", 
      "description": "Returns WAV format"
    },
    {
      "id": "wav",
      "name": "WAV",
      "mime_type": "audio/wav",
      "description": "WAV audio format"
    },
    {
      "id": "pcm",
      "name": "PCM",
      "mime_type": "audio/wav",
      "description": "Returns WAV format"
    }
  ],
  "count": 6
}

POST /api/validate-text

Validate text length and get splitting suggestions.

Request Body:
{
  "text": "Your text to validate",
  "max_length": 4096
}
Response Example:
{
  "text_length": 5000,
  "max_length": 4096,
  "is_valid": false,
  "needs_splitting": true,
  "suggested_chunks": 2,
  "chunk_preview": [
    "First chunk preview...",
    "Second chunk preview..."
  ]
}

POST /api/generate

Generate speech from text.

Request Body:
{
  "text": "Hello, world!",
  "voice": "alloy",
  "format": "mp3",
  "instructions": "Speak cheerfully",
  "max_length": 4096,
  "validate_length": true
}
Parameters:
  • text (required): Text to convert to speech
  • voice (optional): Voice ID (default: "alloy")
  • format (optional): Audio format (default: "mp3")
  • instructions (optional): Voice modulation instructions
  • max_length (optional): Maximum text length (default: 4096)
  • validate_length (optional): Enable validation (default: true)
Response:

Returns audio file with appropriate Content-Type header.

Python Package

Long Text Support

The TTSFM Python package includes built-in long text splitting functionality for developers who need fine-grained control:

from ttsfm import TTSClient, Voice, AudioFormat

# Create client
client = TTSClient()

# Generate speech from long text (automatically splits into separate files)
responses = client.generate_speech_long_text(
    text="Very long text that exceeds 4096 characters...",
    voice=Voice.ALLOY,
    response_format=AudioFormat.MP3,
    max_length=2000,
    preserve_words=True
)

# Save each chunk as separate files
for i, response in enumerate(responses, 1):
    response.save_to_file(f"part_{i:03d}.mp3")
Developer Features:
  • Manual Splitting: Full control over text chunking for advanced use cases
  • Word Preservation: Maintains word boundaries for natural speech
  • Separate Files: Each chunk saved as individual audio file
  • CLI Support: Use `--split-long-text` flag for command-line usage
Note: For web users, the auto-combine feature in `/v1/audio/speech` is recommended as it automatically handles long text and returns a single seamless audio file.

POST /api/generate-combined

Generate a single combined audio file from long text. Automatically splits text into chunks, generates speech for each chunk, and combines them into one seamless audio file.

Request Body:
{
  "text": "Very long text that exceeds the limit...",
  "voice": "alloy",
  "format": "mp3",
  "instructions": "Optional voice instructions",
  "max_length": 4096,
  "preserve_words": true
}
Response:

Returns a single audio file containing all chunks combined seamlessly.

Response Headers:
  • X-Chunks-Combined: Number of chunks that were combined
  • X-Original-Text-Length: Original text length in characters
  • X-Audio-Size: Final audio file size in bytes

POST /v1/audio/speech

Enhanced OpenAI-compatible endpoint with auto-combine feature. Automatically handles long text by splitting and combining audio chunks when needed.

Request Body:
{
  "model": "gpt-4o-mini-tts",
  "input": "Text of any length...",
  "voice": "alloy",
  "response_format": "mp3",
  "instructions": "Optional voice instructions",
  "speed": 1.0,
  "auto_combine": true,
  "max_length": 4096
}
Enhanced Parameters:
  • auto_combine (boolean, default: true):
    • true: Automatically split long text and combine audio chunks into a single file
    • false: Return error if text exceeds max_length (standard OpenAI behavior)
  • max_length (integer, default: 4096): Maximum characters per chunk when splitting
Response Headers:
  • X-Auto-Combine: Whether auto-combine was enabled (true/false)
  • X-Chunks-Combined: Number of audio chunks combined (1 for short text)
  • X-Original-Text-Length: Original text length (for long text processing)
  • X-Audio-Format: Audio format of the response
  • X-Audio-Size: Audio file size in bytes
docs.examples_title
# Short text (works normally)
curl -X POST http://tts.isp.skin/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Hello world!",
    "voice": "alloy"
  }'

# Long text with auto-combine (default)
curl -X POST http://tts.isp.skin/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Very long text...",
    "voice": "alloy",
    "auto_combine": true
  }'

# Long text without auto-combine (will error)
curl -X POST http://tts.isp.skin/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Very long text...",
    "voice": "alloy",
    "auto_combine": false
  }'
Audio Combination: Uses advanced audio processing (PyDub) when available, with intelligent fallbacks for different environments. Supports all audio formats.
Use Cases:
  • Long Articles: Convert blog posts or articles to single audio files
  • Audiobooks: Generate chapters as single audio files
  • Podcasts: Create podcast episodes from scripts
  • Educational Content: Convert learning materials to audio
Example Usage:
# Python example
import requests

response = requests.post(
    "http://tts.isp.skin/api/generate-combined",
    json={
        "text": "Your very long text content here...",
        "voice": "nova",
        "format": "mp3",
        "max_length": 2000
    }
)

if response.status_code == 200:
    with open("combined_audio.mp3", "wb") as f:
        f.write(response.content)

    chunks = response.headers.get('X-Chunks-Combined')
    print(f"Combined {chunks} chunks into single file")

WebSocket Streaming

Real-time audio streaming for enhanced user experience. Get audio chunks as they're generated instead of waiting for the complete file.

WebSocket streaming provides lower perceived latency and real-time progress tracking for TTS generation.

Connection

// JavaScript WebSocket client
const client = new WebSocketTTSClient({
    socketUrl: 'http://tts.isp.skin',
    debug: true
});

// Connection events
client.onConnect = () => console.log('Connected');
client.onDisconnect = () => console.log('Disconnected');

Streaming TTS Generation

// Generate speech with real-time streaming
const result = await client.generateSpeech('Hello, WebSocket world!', {
    voice: 'alloy',
    format: 'mp3',
    chunkSize: 1024,  // Characters per chunk
    
    // Progress callback
    onProgress: (progress) => {
        console.log(`Progress: ${progress.progress}%`);
        console.log(`Chunks: ${progress.chunksCompleted}/${progress.totalChunks}`);
    },
    
    // Receive audio chunks in real-time
    onChunk: (chunk) => {
        console.log(`Received chunk ${chunk.chunkIndex + 1}`);
        // Process or play audio chunk immediately
        processAudioChunk(chunk.audioData);
    },
    
    // Completion callback
    onComplete: (result) => {
        console.log('Streaming complete!');
        // result.audioData contains the complete audio
    }
});

WebSocket Events

Client β†’ Server Events
Event Description Payload
generate_stream Start TTS generation {text, voice, format, chunk_size}
cancel_stream Cancel active stream {request_id}
Server β†’ Client Events
Event Description Payload
stream_started Stream initiated {request_id, timestamp}
audio_chunk Audio chunk ready {request_id, chunk_index, audio_data, duration}
stream_progress Progress update {progress, chunks_completed, total_chunks}
stream_complete Generation complete {request_id, total_chunks, status}
stream_error Error occurred {request_id, error, timestamp}

Benefits

  • Real-time feedback: Users see progress as audio generates
  • Lower latency: First audio chunk arrives quickly
  • Cancellable: Stop generation mid-stream if needed
  • Efficient: Process chunks as they arrive

Example: Streaming Audio Player

// Create a streaming audio player
const audioChunks = [];
let isPlaying = false;

const streamingPlayer = await client.generateSpeech(longText, {
    voice: 'nova',
    format: 'mp3',
    
    onChunk: (chunk) => {
        // Store chunk
        audioChunks.push(chunk.audioData);
        
        // Start playing after first chunk
        if (!isPlaying && audioChunks.length >= 3) {
            startStreamingPlayback(audioChunks);
            isPlaying = true;
        }
    },
    
    onComplete: (result) => {
        // Ensure all chunks are played
        finishPlayback(result.audioData);
    }
});
Try It Out!

Experience WebSocket streaming in action at the WebSocket Demo or enable streaming mode in the Playground.