Local Inference Roadmap - Whisper Local Transcription¶
Goal: Implement local Whisper transcription using CPU-only inference for offline usage and cost reduction.
Status: ✅ Implementation Complete - Documentation Pending Priority: High 🔥
Implementation Summary¶
Approach: Speaches (Self-hosted OpenAI-compatible server)
Why Speaches: - ✅ OpenAI API-compatible (drop-in replacement, zero code changes) - ✅ Docker-based deployment (simple setup) - ✅ Dynamic model loading (on-demand) - ✅ Production-ready and actively maintained - ✅ CPU/GPU support
Configuration-Based Routing:
{
"transcription": {
"backend": "openai", // or "speaches"
"openai": {
"apiKey": "sk-...",
"model": "whisper-1"
},
"speaches": {
"url": "http://localhost:8000/v1",
"apiKey": "none",
"model": "Systran/faster-whisper-base"
}
}
}
Deployment Modes¶
1. OpenAI Cloud (Default) ☁️¶
- ✅ Zero setup - ✅ Proven reliability - ❌ Requires internet - ❌ API costs2. Speaches Local 🏠¶
- ✅ 100% offline - ✅ Zero API costs - ✅ Complete privacy - ❌ Requires Docker3. Speaches Remote 🌐¶
- ✅ Dedicated resources - ✅ Multi-user support - ✅ GPU acceleration - ❌ Requires server setupDocker Setup (Local)¶
docker-compose.speaches.yml:
services:
speaches:
image: ghcr.io/speaches-ai/speaches:latest-cpu
restart: unless-stopped
ports:
- "8000:8000"
volumes:
- ./hf-cache:/home/ubuntu/.cache/huggingface/hub
environment:
# Keep model loaded in memory forever (zero-latency transcription)
- STT_MODEL_TTL=-1
# CPU inference configuration
- WHISPER__INFERENCE_DEVICE=cpu
- WHISPER__COMPUTE_TYPE=int8
- WHISPER__CPU_THREADS=8
- WHISPER__USE_BATCHED_MODE=true
deploy:
resources:
limits:
cpus: '8'
memory: 4G
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
Commands:
Model Selection¶
Model | Size | Memory | Use Case | Speed |
---|---|---|---|---|
tiny | 75 MB | ~273 MB | Fast, lower accuracy | Very fast |
base | 142 MB | ~388 MB | Good balance ⭐ | Fast |
small | 466 MB | ~852 MB | Better accuracy | Medium |
medium | 1.5 GB | ~2.1 GB | High accuracy | Slower |
large-v3 | 2.9 GB | ~3.9 GB | Best accuracy | Slowest |
Recommendation: Use base
for voice dictation (fast + good accuracy)
Implementation Phases¶
✅ Phase 4.1: Research & Architecture (COMPLETED)¶
- Research Speaches capabilities
- Validate OpenAI compatibility
- Design configuration architecture
Status: Completed - Speaches validated as best solution
✅ Phase 4.2: Configuration Support (COMPLETED)¶
Goal: Add configuration-based backend selection
Implemented Config Schema:
{
"transcription": {
"backend": "openai",
"openai": {
"apiKey": "sk-...",
"model": "whisper-1"
},
"speaches": {
"url": "http://localhost:8000/v1",
"apiKey": "none",
"model": "Systran/faster-whisper-base"
}
}
}
Completed Tasks: - [x] Add transcription.backend
field ('openai' | 'speaches') - [x] Add transcription.speaches.url
and transcription.speaches.model
- [x] Add transcription.openai.apiKey
and transcription.openai.model
- [x] Update Config.getTranscriptionConfig()
to return all fields - [x] Add URL validation for Speaches (validateSpeachesUrl()
) - [x] Update config.example.json
with new structure - [x] Add tests for new config fields - [x] Add benchmarkMode
for side-by-side comparison
Files Modified: - ✅ src/config/config.ts
- Full implementation - ✅ config.example.json
- Updated with new schema - ✅ src/config/config.test.ts
- Tests added
✅ Phase 4.3: TranscriptionService Refactor (COMPLETED)¶
Goal: Lazy initialization + unified transcription method
Implemented Architecture: - ✅ Lazy client initialization via getClient(backend)
- ✅ initializeSpeaches()
with proper async/await and error handling - ✅ loadSpeachesModel()
for model preloading - ✅ Single unified transcribe()
method (no separate methods) - ✅ warmup()
method for startup model preloading
Completed Tasks: - [x] Add getClient(backend: 'openai' | 'speaches')
method - [x] Implement lazy initialization (clients created on-demand) - [x] Add initializeSpeaches()
with full error handling - [x] Add loadSpeachesModel()
with POST to /v1/models/{model}
- [x] Single transcribe()
method supporting both backends - [x] Add warmup()
for preloading at startup - [x] Proper error propagation and logging - [x] Add tests for both backends - [x] Add tests for lazy initialization - [x] Add tests for error handling
Implementation Details:
// Lazy initialization - clients created only when needed
private async getClient(backend: "openai" | "speaches"): Promise<...>
// Preload Speaches model at startup (called from VoiceTranscriberApp)
public async warmup(forceSpeaches = false): Promise<...>
// Single transcribe method - backend determined by config
public async transcribe(filePath: string): Promise<...>
Files Modified: - ✅ src/services/transcription.ts
- Full refactor - ✅ src/services/transcription.test.ts
- Comprehensive tests - ✅ src/index.ts
- Calls warmup()
on startup
✅ Phase 4.3b: Speaches Model Preloading (COMPLETED)¶
Goal: Keep model loaded in memory for zero-latency transcription
Implementation: 1. Application-level preloading: - TranscriptionService.warmup()
called at app startup - loadSpeachesModel()
POSTs to /v1/models/{model}
- Conditional preload (Speaches backend OR benchmark mode)
- Docker-level persistence:
STT_MODEL_TTL=-1
in environment variables (never unload)- Healthcheck validates server availability
Completed Tasks: - [x] Implement loadSpeachesModel()
with POST to model endpoint - [x] Add warmup()
method for startup preloading - [x] Call warmup()
from main app when using Speaches - [x] Add STT_MODEL_TTL=-1
to docker-compose environment - [x] Docker healthcheck for server availability - [x] Cache directory mounted (./hf-cache
)
Files Modified: - ✅ src/services/transcription.ts
- warmup()
and loadSpeachesModel()
- ✅ src/index.ts
- Calls warmup()
on startup - ✅ docker-compose.speaches.yml
- All environment variables included
✅ Phase 4.3c: Main Application Integration (COMPLETED)¶
Goal: Clean architecture with unified processing
Implementation: - ✅ Single processAudioFile()
for normal mode - ✅ Separate processBenchmark()
for comparison mode - ✅ No legacy methods (clean architecture from start) - ✅ Benchmark mode creates two TranscriptionService instances - ✅ Detailed comparison metrics (performance, similarity, differences)
Completed Tasks: - [x] Implement processAudioFile()
using unified transcription - [x] Implement processBenchmark()
for side-by-side comparison - [x] Add similarity analysis (Levenshtein distance) - [x] Add text difference detection - [x] Choose best result automatically (longest transcription) - [x] Update tests for new architecture
Architecture:
// Normal mode: Uses configured backend
async processAudioFile(filePath: string): Promise<void>
// Benchmark mode: Compares both backends side-by-side
async processBenchmark(filePath: string): Promise<void>
Files Modified: - ✅ src/services/audio-processor.ts
- Both modes implemented - ✅ src/services/audio-processor.test.ts
- Tests for both modes - ✅ src/index.ts
- Conditional logic based on benchmarkMode
- ✅ src/utils/text-similarity.ts
- Similarity utilities
⚠️ Phase 4.4: Documentation (PARTIAL)¶
Goal: Complete user-facing documentation
Completed: - [x] docker-compose.speaches.yml
with environment variables - [x] config.example.json
with new schema - [x] Code documentation (JSDoc comments) - [x] This roadmap document
Missing: - [ ] Create docs/SPEACHES_SETUP.md
guide - [ ] Update main README.md
with Speaches section - [ ] Add troubleshooting guide for Speaches - [ ] Document benchmark mode usage - [ ] Add performance comparison data
Priority: 🔥 High - Users need setup instructions
✅ Phase 4.5: Testing & Validation (COMPLETED)¶
Completed: - [x] Unit tests for TranscriptionService - [x] Unit tests for Config - [x] Unit tests for AudioProcessor - [x] Tests for both OpenAI and Speaches backends - [x] Tests for lazy initialization - [x] Tests for error handling - [x] Tests for benchmark mode - [x] Similarity calculation tests
Test Coverage: - ✅ src/config/config.test.ts
- Configuration validation - ✅ src/services/transcription.test.ts
- Backend switching - ✅ src/services/audio-processor.test.ts
- Processing workflows - ✅ src/utils/text-similarity.test.ts
- Comparison utilities
Manual Testing Needed: - [ ] Real Speaches deployment test - [ ] Model loading performance benchmarks - [ ] Multi-language validation - [ ] Long audio file tests
📋 Phase 4.6: Release (PLANNED)¶
Goal: Prepare for production release
Tasks: - [ ] Complete documentation (Phase 4.4) - [ ] Manual testing with real Speaches instance - [ ] Performance benchmarking documentation - [ ] Update CHANGELOG.md - [ ] Version bump to 0.3.0 - [ ] Git tag and release notes - [ ] Update README badges/status
Blocked by: Phase 4.4 (Documentation)
Implementation Timeline¶
Total Duration: ~4 days (3.5 days completed)
Phase | Estimated | Actual | Status |
---|---|---|---|
4.1 Research | 1 day | 1 day | ✅ Done |
4.2 Config | 0.5 days | 0.5 days | ✅ Done |
4.3 Service | 0.5 days | 1 day | ✅ Done (more comprehensive) |
4.3b Preload | - | 0.5 days | ✅ Done (added scope) |
4.3c Integration | - | 0.5 days | ✅ Done (added scope) |
4.4 Docs | 1 day | - | ⚠️ Partial |
4.5 Testing | 1 day | 0.5 days | ✅ Done |
4.6 Release | 0.5 days | - | 📋 Planned |
Progress: 85% complete (implementation done, documentation pending)
Additional Features Implemented¶
🔬 Benchmark Mode¶
- Compare OpenAI and Speaches side-by-side
- Performance metrics (speed, duration)
- Text similarity analysis (Levenshtein distance)
- Word-level difference detection
- Automatic best result selection
- Enable via
"benchmarkMode": true
in config
🎯 Text Similarity Analysis¶
- Levenshtein distance calculation
- Character-level and word-level comparison
- Difference highlighting
- Similarity percentage scoring
⚡ Zero-Latency Transcription¶
- Model preloading at startup
- Persistent model in Docker (STT_MODEL_TTL=-1)
- Lazy client initialization
- Minimal overhead for second+ transcriptions
Known Issues & Limitations¶
Current Limitations:¶
- Documentation: Setup guides incomplete (Phase 4.4)
- Manual Testing: No real-world Speaches deployment tested yet
- Performance Data: No documented benchmarks yet
Future Improvements:¶
- GPU support documentation
- Multiple model support (runtime switching)
- Remote Speaches server examples
- Performance tuning guide
- Cost analysis (OpenAI vs self-hosted)
Next Steps¶
Immediate (Phase 4.4 - Documentation):¶
- Create
docs/SPEACHES_SETUP.md
comprehensive guide - Update main
README.md
with Speaches section - Add troubleshooting section
- Document benchmark mode
Before Release (Phase 4.6):¶
- Manual test with real Speaches deployment
- Run benchmark mode with sample audio
- Document performance results
- Update CHANGELOG
- Version bump and release
References¶
Version: v3.0 (Updated with actual implementation status) Last Updated: 2025-01-11 Project Status: 🟢 Implementation Complete - Documentation Pending