iOS 10 was released this week, and there are so many new API’s to take advantage of, such as: SFSpeechRecognizer, SiriKit, CallKit, and many more! In this tutorial, you can learn how to use the SFSpeechRecognizer API in addition to the AVSpeechSynthesizer API to create an app that you can talk to. While I was trying out the SFSpeechRecognizer API, I was having trouble using both the SFSpeechRecognizer and the AVSpeechSynthesizer together. Here’s how I did it:

So, lets get started! You can download the starter project here. First unzip the project and open it in Xcode.

Head on over to ViewController.swift, and add the following code:

import Speech
import AVFoundation

Then, add these variables:

@IBOutlet var label: UILabel

var greetings = ["hi", "hello", "howdy", "what's up", "hey", "good morning", "good afternoon", "good evening"]
var salutations = ["bye", "see you later", "bye for now"]

let audioEngine = AVAudioEngine()
let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
let request = SFSpeechAudioBufferRecognitionRequest()
var recognitionTask: SFSpeechRecognitionTask?
var speechSynthesizer = AVSpeechSynthesizer()

These lines of code instantiate our keywords (these will be recognized by the speech recognizer), the AVAudioEngine, the SFSpeechRecognizer, the speech recognition request, the recognition task, and the speech synthesizer.

Then, in the viewDidLoad() function, add these lines:

override func viewDidLoad() {
     super.viewDidLoad()

     speechRecognizer.delegate = self
     speechSynthesizer.delegate = self

     public func speechSynthesizer(_ speechSynthesizer: AVSpeechSynthesizer, didFinish utterance: AVSpeechUtterance) {
         try! self.startRecording()
     }

     SFSpeechRecognizer.requestAuthorization { authStatus in
         OperationQueue.main.addOperation {
             switch authStatus {
                 case .authorized:
                     break
                 default:
                     print(authStatus)
             }
         }
     }
     try? startRecording()
}

Here, we are basically setting the speech synthesizer and recognizer delegates as the view controller, and asking for permission to use Speech Recognition and then starting the recording. Since we are using the startRecording() function, let’s implement that next: 

func startRecording() throws {
     let node = audioEngine.inputNode!
     let recordingFormat = node.outputFormat(forBus: 0)
     node.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, _) in
         self.request.append(buffer)
     }
     audioEngine.prepare()
     try audioEngine.start()
     recognitionTask = speechRecognizer.recognitionTask(with: request) { result, error in
         if let result = result {
             print(result.bestTranscription.formattedString.lowercased())
             if self.greetings.contains(result.bestTranscription.formattedString.lowercased()) {
                 node.removeTap(onBus: 0)
                 self.cancelRecording()
                 self.speak(self.greetings[Int(arc4random_uniform(UInt32(self.greetings.count)))])
             } else if self.salutations.contains(result.bestTranscription.formattedString.lowercased()) {
                 node.removeTap(onBus: 0)
                 self.cancelRecording()
                 self.speak(self.salutations[Int(arc4random_uniform(UInt32(self.salutations.count)))])
             } else {
                 node.removeTap(onBus: 0)
                 self.cancelRecording()
                 self.speak("Sorry, I didn't recognize that.")
             }
             self.label.text = result.bestTranscription.formattedString.lowercased()
         }
     }
}

In this function, we are installing a tap on an input node (this would be the microphone), and then appending the audio buffer coming in from the microphone to our request, which is being sent to Apple’s servers. Then, we receive the transcriptions of the audio buffer we are constantly sending to the server. The tap on the node causes us to continue streaming the audio buffer to the server, enabling real-time results. Once we get those results, we check if the result shows up in the greetings or in the salutations. If the result is in the greetings, we use the speech synthesizer to speak a random greeting, and if the result is in the salutations, we speak a random salutation. If neither arrays contains the result, we say that we do not know what was spoken.

The problem I encountered while using SFSpeechRecognizer and AVSpeechSynthesizer simultaneously was that the speech synthesizer would not speak when the speech recognizer had a recognition task running or if the AVAudioEngine was running. Because of this, I created a function cancelRecording() which is as follows:

func cancelRecording() {
     audioEngine.stop()
     request.endAudio()
     recognitionTask?.cancel()
}

This function basically stops the audioEngine, the recognition task, and the request. The final function left to implement is the speak() function. Here it is:

func speak(_ speechString: String) {
     let speechUtterance = AVSpeechUtterance(string: speechString)
     speechUtterance.volume = 1.0
     speechUtterance.rate = 0.5
     speechUtterance.pitchMultiplier = 1.15
     speechSynthesizer.speak(speechUtterance)
}

This function simply takes in the string and uses the AVSpeechSynthesizer to speak the text. Nice and simple!

Finally, add the usage description for the key: NSSpeechRecognitionUsageDescription in info.plist.

And, that’s it for this tutorial! You can download the final project here. Now, you can say any keyword from the greetings or the salutations array and the app will speak another random greeting or salutation back to you! Here’s a video of it in action:

 

I hope you enjoyed this tutorial, and if you have any questions or comments, feel free to join the discussion below! Come back for some more tutorials on iOS Development!