Voice control is becoming increasingly important for smart home systems. However, most available solutions come with significant privacy concerns and hardware requirements. In this post, we’ll explore how to build a private voice satellite for HomeAssistant using iOS devices, focusing on local processing to maximize privacy and security.
Current Challenges with Voice Control
When looking at existing HomeAssistant voice pipelines and satellite solutions, several issues become clear:
All processing typically happens server-side, from wake word detection to speech-to-text and text-to-speech conversion, requiring constant data transmission and raising privacy concerns
Significant server computing power and hardware resources are required for acceptable performance and response times, making self-hosting challenging
Server-based wake word detection requires continuous audio streaming from the satellite to the server, consuming significant bandwidth
This creates major privacy and security concerns, especially when cloud services are involved, as sensitive audio data is transmitted and processed remotely
Limited control over data processing and retention policies when using third-party cloud services
Network latency can impact responsiveness and user experience
Potential single point of failure if the server becomes unavailable
A Privacy-Focused Approach
To address these challenges, I propose a solution based on the following key principles:
An iPad (Gen 6 or newer) serves as the voice satellite. Gen 6 does not have a Neural Engine, but it is still cheap and sufficient
Local wake word detection using offline models to identify activation phrases without cloud dependencies. Picovoice Porcupine is very well suited for this task
Local speech processing (both speech-to-text and text-to-speech) leveraging iOS’s built-in speech engines and neural processing capabilities
Minimal data transfer, only sending detected text commands to HomeAssistant over an encrypted connection, with no raw audio transmission
HomeAssistant uses a LLM-based conversation agent to handle the detected text command and generate responses. A small quantized LLM can run on the HomeAssisant Server using ollama.cpp
This approach significantly reduces privacy concerns since audio never leaves the device. All voice processing happens directly on the iOS device hardware, eliminating the need for continuous audio streaming or cloud-based processing services. The system maintains full functionality even when internet connectivity is limited, with only the final processed text commands requiring network access.
Technical Implementation
On a technical level, we need the following components to perform the task.
I don’t want to add the full source code here, as I just want to focus on the technical implementation details and things we have to take care of.
Wake word detection
Picovoice Porcupine can be added as a Swift package dependency, as a Swift IOS SDK is already provided. We just need to have an API key, which can be obtained from Picovoice. Please note that a free license is only provided for private, non-commercial projects, so we have to take care of that.
One important implementation detail is the audio format. We need to send Porcupine the same audio format for which it has been trained, so it is a 16kHz, 16bit, and 1 channel mono audio stream. This is critical since wake word detection models are highly sensitive to audio format mismatches which can severely impact detection accuracy. We have to set up our iOS AudioSession accordingly by configuring the correct audio format parameters. Porcupine only accepts 16bit integer values, which is different to the default float datapoints from the iOS AudioSession, so we have to add some conversation logic here using AudioConverters. Porcupine also expects chunked data, with a fixed chunk size (typically 512 samples), so we have to chop our AVSession audio data into chunks using ring buffers and send them to Porcupine. The chunking needs to be sample-accurate to avoid detection issues.
For every processed chunk, we get the wake word detection status back along with confidence scores. Once a wake word is detected with sufficient confidence (typically >0.5), we continue with the Speech to text phase.
Speech to text
The Speech to text phase instantiates a SFSpeechAudioBufferRecognitionRequest. All recorded audio data is sent to the request, which will parse the data and report the parsed text back. The speech recognition engine uses on-device neural networks optimized for the iOS Neural Engine to perform real-time transcription with high accuracy.
The important part here is that we enable shouldReportPartialResults for this request. This way we get a notification for every detected spoken word, even if the full parsing is not completed yet, allowing for a more responsive user experience. The partial results also help detect speech boundaries more accurately. Once we get a notification, we start a voice inactivity timer, let’s say for 2 seconds. If we do not get the next parse result from the RecognitionRequest, we assume that the speaker has finished his request, and we can continue with the next phase. Using this timer combined with (optional) audio power level detection, we can implement a basic but effective voice activity detection system that reliably detects speech endpoints.
HomeAssistant conversation agent integration
The HomeAssistant conversation agent integration is very straightforward. We send a request to the /api/conversation/process endpoint over REST, and wait for the textual response. The conversation agent handles all LLM interaction. The LLM configuration and everything else is done within HomeAssistant, so we can use Home-LLM or the Anthropic integration here. The agent processes the text using intent detection and entity extraction to understand the user’s request.
The interesting part is the conversation_id. We can use the same conversation id for different agent requests and different wake word activations. This gives the LLM an enhanced context, a kind of memory, to build responses on. So we can use voice commands to train the LLM with knowledge which it can use to generate responses. The conversation history allows for more natural dialogue flow and contextual understanding. This knowledge is not limited to home automation requests, it can be used for almost everything, like family members' names, birthdays, habits, and so on. The context retention also enables follow-up questions and corrections. This hidden gem is really mighty! The textual response of the HomeAssistant API invocation is passed to the text to speech stage.
Text to speech
Text to speech generation can be done using the AVSpeechSynthesizer API. The important thing here is to select the right voice to make the generated speech natural. iOS provides multiple voice options with different characteristics like gender, age, and accent. We can install additional high-quality neural voices using the iPad settings. These voices use advanced deep learning models to generate more natural prosody and intonation. The quality models need more RAM and processing power, but they are definitely worth it as they significantly improve the naturalness of speech. We can also fine-tune parameters like speaking rate, pitch and volume. Once the generated speech is finished playing through the audio system with proper audio session handling, we start over with the wake word detection stage while maintaining conversation context.
This approach demonstrates that private, efficient voice control is achievable using existing and affordable hardware.
Real-World Performance
Testing this setup in practice revealed:
Porcupine provides reliable wake word detection with minimal false positives and negligible processing delays
Speech-to-text works well even on older iPads without neural engines, though slower speech input may be needed
Timer-based voice activity detection is very effective to detect speech boundaries
Text-to-speech produces natural-sounding output when using extended language models, though some artificial qualities remain noticeable
Future Improvements
Several enhancements could further improve the system across multiple dimensions:
Security:
Replacing Porcupine with OpenWakeWord for a fully open-source solution with auditable code
Implementing input validation and sanitization for voice commands
Adding voice biometric authentication to prevent unauthorized access and allow personalized responses
Encrypting all cached voice data and command history
Usability:
Supporting offline operation mode with local command caching
Adding customizable wake word options and voice profiles
Improving accessibility with visual feedback and alternative input methods
Maintainability:
Moving to a modular architecture to easily swap components
Adding comprehensive logging and monitoring
Implementing automated testing for voice recognition accuracy
Creating tools for easy model updates and configuration management
