July 14, 2012

What it takes to speak kana

The latest version of KanaSwirl adds a pronunciation feature. The first obstacle I ran into in adding this feature was getting high-quality pronunciation audio snippets. After a long search, I ended up using a text-to-speech engine since that produced better results than anything else available to me right now. I'll continue to search, but what I've got now works. The rest of the audio obstacles were more technical.

Audio Types

Each iOS application that wants to publish audio has to declare what kinds of audio it's producing. A game's explosion sound effects are different from a podcast app's playback, which are both different from an application that records audio from the device's microphone. Each of these kinds of audio has a specific type. Each type audio has a different set of behaviors, which cover things like its affect on other apps' audio, whether the mute switch is respected or ignored, and so on.

It turns out that the correct type of audio for KanaSwirl varies. One of the settings for playing KanaSwirl is "Speak Always (Hiding Romaji)". Instead of the usual romaji in the middle of the swirl, you see a speaker icon. Each round, you hear a kana, then you tap the kana corresponding to the audio you heard. If you want to hear the pronunciation again you can just tap the speaker. It looks like this:

Now what if you have the mute switch on? If KanaSwirl respected the mute switch, then it would be impossible to play because you would not be able to hear the prompt. Every round would just be a guessing game. So KanaSwirl ignores the mute switch by declaring its audio type as media playback (much like a podcast app). This audio type is reserved for applications where hearing audio is vital to the application's operation, and in "Speak Always (Hiding Romaji)" mode, it most certainly is.

Furthermore, in other game modes like "Speak Manually", what should happen if your mute switch is engaged and you tap the speaker icon? If KanaSwirl respected the mute switch here and played no audio, that would be bad. Such behavior makes the app seem buggy because it is not actually responding to the user's commands. Another reason to use the media playback audio type to ignore the mute switch.

UInt32 type = kAudioSessionCategory_MediaPlayback;
AudioSessionSetProperty(kAudioSessionProperty_AudioCategory, sizeof(type), &type);

Unfortunately, as @jjcappa reported to me a few days ago, if KanaSwirl is not in "Speak Always (Hiding Romaji)" mode, and is instead in, say, "Speak New Kana Only", the automatically spoken pronunciations are supplementary and they should respect the mute switch. I hadn't considered this, but I will fix this for 2.1. The fix will be tricky because the audio type will necessarily need to change dynamically when you tap the speaker icon versus when the game automatically speaks for you.

You might wonder why I don't just inspect whether the mute switch is on or off and play audio appropriately. Apple actively discourages this and instead wants you to declare your audio type appropriately. I agree with this stricture because your application would continue to function correctly even in the face of changing hardware and operating systems. If the mute switch were to disappear, only Apple's way would have a chance of coping correctly.

Mixing Audio

Since KanaSwirl does yet not have background music, I encourage people to listen to their own music as they play the game. But that doesn't work with the media playback audio type that KanaSwirl uses; media playback silences other applications' audio. So I have to explicitly tell iOS that KanaSwirl can play well with others:

UInt32 mix = 1;
AudioSessionSetProperty(kAudioSessionProperty_OverrideCategoryMixWithOthers, sizeof(mix), &mix);

Simple enough, but there's still a problem here. Your music is typically going to drown out the pronunciations from the game, and that would be just as bad as having no pronunciations at all. It would be jarring and downright horrible to pause your music for a moment each time KanaSwirl speaks a character. Maybe KanaSwirl would be better off if it did forbid other audio?

It turns out there's a middle ground here: a feature called ducking. Ducking is an audio technique for reducing the volume of one audio track (say, your background music) in order to for the listener to better hear another audio track (say, KanaSwirl pronouncing セ).

iOS has APIs for auto-ducking, and like everything else I'm describing in this post, it was tricky to get exactly right! First, I tried to duck only when the game speaks a kana. This has a similar jarring effect as I described above: your audio keeps cutting in and out. Furthermore this attempt introduced an annoying lag every time the game ducked your audio because it would have to gradually turn your background music down, play the kana pronunciation, then turn your background music up. For some reason this hung the game. Delaying the pronunciation by a good second for a fast-paced game like KanaSwirl is clearly not OK.

So what I do instead is just leave autoducking enabled. But we really only want to autoduck if the speech is vital to the game. That's the case only in "Speak Always (Hiding Romaji)" mode. So the method I use to set up audio takes a parameter called ducking:

+(void) setSessionActiveWithMixing:(bool)ducking {
    ...
    if (ducking) {
        isDucking = YES;

        UInt32 duck = 1;
        AudioSessionSetProperty(kAudioSessionProperty_OtherMixableAudioShouldDuck, sizeof(duck), &duck);
    }
    else {
        isDucking = NO;
    }
    ...
}

Then when we initialize audio, we simply pass in a boolean indicating whether the game is in "Speak Always (Hiding Romaji)" mode.

Unfortunately, that is not good enough. What if the user starts the game, plays a few rounds, then decides she doesn't like this mode? So she goes to the Settings app and switches the game to "Speak Manually" mode? Well we're still ducking audio even though she may never hear a single pronunciation. That's no good!

What we can do is subscribe to iOS's "changed app settings" API. Then when the user changes the Speak mode from "Always (Hiding Romaji)" to "Manually", iOS informs us so we change the audio settings appropriately.

Turns out even this is still not good enough! Here's an unfortunate sequence of events that this implementation produces:

  1. The user starts the game in "Speak Always (Hiding Romaji)".
  2. KanaSwirl loads and configures its audio settings.
  3. iOS turns down the user's music based on the app's audio settings.
  4. The user plays a few rounds then decides to change the speech mode.
  5. The user taps the home button, sending the game into the background.
  6. iOS turns up the user's music.
  7. The user navigates to Settings → KanaSwirl → Speak and sets its value to "Manually".
  8. The user double taps home and resumes KanaSwirl.
  9. iOS turns down the user's music based on the app's audio settings.
  10. iOS invokes the NSUserDefaultsDidChangeNotification event for KanaSwirl.
  11. KanaSwirl reinitializes its audio settings without ducking because "Speak Manually" does not need it.
  12. iOS turns up the user's music.
  13. The user continues playing, now slightly disoriented.

The problem here is that in the interval between steps 8 and 12 (which lasts only a few hundred milliseconds), the user's audio went from loud to quiet to loud again. Users will notice that. It is unacceptable.

I've found that the best way to solve this problem is to manage ducking not when you see the NSUserDefaultsDidChangeNotification event, but immediately once your app goes into the background, as in step 5! iOS is already un-ducking the audio at this point, so this change has no immediate user-audible effect anyway.

Then, when your app comes back into the foreground, the audio does not duck momentarily. You can reinitialize your audio, ducking or no, during applicationDidBecomeActive. The new sequence of events is much more pleasant.

  1. The user starts the game in "Speak Always (Hiding Romaji)".
  2. KanaSwirl loads and configures its audio settings.
  3. iOS turns down the user's music based on the app's audio settings.
  4. The user plays a few rounds then decides to change the speech mode.
  5. The user taps the home button, sending the game into the background.
  6. KanaSwirl turns off ducking.
  7. iOS turns up the user's music.
  8. The user navigates to Settings → KanaSwirl → Speak and sets its value to "Manually".
  9. The user double taps home and resumes KanaSwirl.
  10. KanaSwirl reinitializes its audio settings without ducking because "Speak Manually" does not need it.
  11. iOS invokes the NSUserDefaultsDidChangeNotification event for KanaSwirl. Which it mostly ignores.
  12. The user continues playing, unawares and inexplicably happy.

All this just to play sound effects correctly!