Skip to content

Voice Transcriber - Technical Architecture

Overview

Voice Transcriber is a lightweight desktop application that provides seamless voice-to-text conversion with system tray integration. The application follows a service-oriented architecture with clear separation of concerns and minimal dependencies.

Core Architecture Principles

1. Simplicity First

  • Each service has 3-5 core methods maximum
  • Simple interfaces: { success: boolean, error?: string }
  • No overengineering or complex retry logic
  • Console logging only (info/error levels)

2. Service-Oriented Design

  • Clear separation of concerns
  • Dependency injection for testability
  • Consistent error handling patterns
  • Graceful degradation when services fail

3. User-Centric Approach

  • First-run setup wizard
  • Visual feedback via system tray states
  • Automatic clipboard integration
  • Multilingual support (French/English auto-detection)

System Components

┌─────────────────────────────────────────────────────┐
│                Main Application                     │
│            (VoiceTranscriberApp)                    │
└─────────────────┬───────────────────────────────────┘
        ┌─────────┼─────────┐
        │         │         │
        ▼         ▼         ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
│   System    │ │    Audio    │ │   Configuration │
│    Tray     │ │  Recording  │ │     Service     │
│   Service   │ │   Service   │ │                 │
└─────────────┘ └─────────────┘ └─────────────────┘
        │                              │
        ▼                              ▼
┌─────────────┐                ┌─────────────────┐
│  Clipboard  │                │  OpenAI API     │
│   Service   │                │   Services      │
└─────────────┘                │                 │
                               │ ┌─────────────┐ │
                               │ │Transcription│ │
                               │ │   Service   │ │
                               │ └─────────────┘ │
                               │ ┌─────────────┐ │
                               │ │ Formatter   │ │
                               │ │   Service   │ │
                               │ └─────────────┘ │
                               └─────────────────┘

Service Descriptions

Main Application (src/index.ts)

Purpose: Central orchestrator that manages all services and application lifecycle.

Key Responsibilities: - Service initialization and dependency injection - Event handling and workflow coordination - Error management and recovery - Graceful shutdown handling

State Machine:

IDLE → RECORDING → PROCESSING → IDLE
  ↑                              ↓
  └──────────── ERROR ←──────────┘

System Tray Service (src/services/system-tray.ts)

Purpose: Manages system tray integration with visual state feedback.

Features: - Three visual states with distinct icons: - 🟢 IDLE: Green circle (ready to record) - 🔴 RECORDING: Red circle (actively recording) - 🟣 PROCESSING: Purple circle (transcribing audio) - Click-to-record functionality - Context menu with Start/Stop/Exit options - Cross-platform icon compatibility (Base64 encoded)

Icon Management: - Icons embedded as Base64 strings for npm distribution - Automatic menu state updates based on recording status - Workaround for node-systray-v2 double icon issues

Audio Recording Service (src/services/audio-recorder.ts)

Purpose: Handles system audio capture using Linux arecord.

Features: - Spawns arecord process for high-quality audio capture - Temporary file management in system temp directory - Process lifecycle management (start/stop/cleanup) - CD-quality WAV format (44.1kHz, 16-bit)

System Dependencies:

# Required for audio recording
sudo apt-get install alsa-utils

Transcription Service (src/services/transcription.ts)

Purpose: Converts audio files to text using OpenAI Whisper API.

Features: - Automatic language detection (French/English mixed speech) - Enhanced prompting for preserving original language structure - Technical term preservation in mixed-language contexts - Robust error handling for API failures

Configuration:

{
  apiKey: string;           // OpenAI API key
  language?: string;        // Auto-detect if undefined
  prompt?: string;          // Custom transcription prompt
}

Formatter Service (src/services/formatter.ts)

Purpose: Optional text enhancement using ChatGPT API.

Features: - Grammar and punctuation improvement - Language preservation (French/English/Spanish/German/Italian) - Configurable enable/disable - Temperature-controlled generation (0.3 for consistency)

Configuration Service (src/config/config.ts)

Purpose: Manages application configuration with user-friendly setup.

Features: - User config directory (~/.config/voice-transcriber/) - Interactive first-run setup wizard - API key validation and storage - JSON-based configuration file

Config Location:

~/.config/voice-transcriber/config.json

Clipboard Service (src/services/clipboard.ts)

Purpose: Cross-platform clipboard operations.

Features: - Automatic text copying after transcription - Cross-platform compatibility (Linux/Windows/macOS) - Simple success/error feedback

Data Flow

Recording Workflow

1. User clicks tray icon
2. System tray → RECORDING state
3. Audio recorder starts arecord process
4. User clicks again to stop
5. System tray → PROCESSING state
6. Audio file saved to temp directory (WAV format)
7. MP3 Encoder converts WAV to MP3 (~75% compression)
8. Transcription service → Whisper API (OpenAI or Speaches)
9. [Optional] Formatter service → ChatGPT API
10. Clipboard service writes final text
11. System tray → IDLE state

Error Handling Flow

Error occurs in any service
Service returns { success: false, error: string }
Main application logs error
System tray returns to IDLE state
User can retry operation

Technology Stack

Runtime & Build

  • Development: Bun ≥1.2.0 with TypeScript
  • Production: Node.js ≥22 (npm distribution)
  • Build: Bun bundler for single-file distribution
  • Package Management: Bun for development, npm for distribution

Core Dependencies

{
  "openai": "^5.11.0",              // OpenAI API integration
  "node-systray-v2": "...",         // System tray (improved fork)
  "clipboardy": "^4.0.0"            // Cross-platform clipboard
}

System Requirements

  • Linux: Ubuntu 22.04+ with alsa-utils and xsel
  • Audio: ALSA-compatible sound system
  • Desktop: System tray support (GNOME, KDE, XFCE)

Performance Characteristics

Memory Usage

  • Base: ~50MB (Node.js runtime + dependencies)
  • Recording: +10MB (audio buffer)
  • Processing: +20MB (API requests/responses)

API Usage

  • Whisper: ~$0.006 per minute of audio
  • GPT-3.5-turbo: ~$0.002 per transcription formatting
  • Rate Limits: Respects OpenAI API limits (no built-in retry)

File System

  • Temp Files: Created in /tmp/transcriber/
  • Config: Stored in ~/.config/voice-transcriber/
  • Cleanup: Automatic temp file cleanup on process exit

Security Considerations

API Key Management

  • Config file permissions: 600 (user read/write only)
  • API key validation on startup
  • No API key logging or exposure

Audio Privacy

  • Local audio processing only
  • Temporary files cleaned up automatically
  • No persistent audio storage

System Integration

  • Minimal system permissions required
  • No elevated privileges needed
  • Sandboxed execution environment

Testing Strategy

Test Coverage

  • Unit Tests: 37 tests across all services
  • Integration Tests: Full workflow validation
  • Mock Strategy: Simple mocks for external dependencies

Test Categories

// Service Tests
AudioRecorder.test.ts     // Recording lifecycle
TranscriptionService.test.ts  // API integration
SystemTrayService.test.ts // UI state management
Config.test.ts           // Configuration handling

// Integration Tests
index.test.ts            // Full application workflow

Testing Commands

make test              # Run all tests
make test-watch        # Watch mode for development
make test-file FILE=   # Run specific test file

Development Workflow

Setup

git clone <repository>
cd voice-transcriber
make install           # Install dependencies
make check-deps        # Verify system requirements
cp config.example.json config.json  # Setup config

Development Loop

make dev              # Start with auto-reload
make test             # Run tests
make format-check     # Lint and format

Build & Release

make build            # Build for production
make release-patch    # Create patch release
npm publish           # Publish to npm

Future Architecture Considerations

Scalability

  • Plugin system for additional AI providers
  • Configurable audio backends (PulseAudio, JACK)
  • Multi-language prompt templates

Platform Expansion

  • Windows support (replace arecord with Windows Audio API)
  • macOS support (replace arecord with Core Audio)
  • Web version using WebRTC

Performance Optimization

  • Local Whisper model integration (faster-whisper)
  • Audio compression before API upload
  • Streaming transcription for long recordings

Enhanced Features

  • Keyboard shortcut integration (when Wayland supports it)
  • Multiple output formats (Markdown, structured text)
  • Batch processing capabilities