Speech UI development on Windows: February 2007

Wednesday, February 21, 2007

Using the SpFileStreamClass Seek Method

After you have opened your file for reading

ISpeechFileStream sfs = new SpFileStreamClass();
sfs.Open(@"C:\dev\Speech Data\sapiWav\wavfile.wav", SpeechStreamFileMode.SSFMOpenForRead, false);

the seek method can be used to
1. return the current audio stream position in bytes

decimal position = (decimal)sfs.Seek(0, SpeechStreamSeekPositionType.SSSPTRelativeToCurrentPosition)

2. move the current audio stream position backward or forward in bytes

//I have a stereo wav file at 16000 Hz, so I use 64000
const int stereoWavBitsPerSecond = 64000;
const int monoWavBitsPerSecond = 32000;

//move forward 4 seconds for a stereo file
sfs.Seek(4 * stereoWavBitsPerSecond , SpeechStreamSeekPositionType.SSSPTRelativeToCurrentPosition)

SpeechStreamSeekPositionType values include: SSSPTRelativeToEnd, SSSPTRelativeToStart, SSSPTRelativeToCurrentPosition

Monday, February 19, 2007

C# starter code for SAPI 5.3 speech recognition from microphone under WPF

So despite the indications here and here to the contrary, it appears that it is possible to successfully create a SAPI 5.3 application to run on Windows XP. Here are the steps for getting a bare-bones C# program (using .NET 3.0 Windows Presentation Foundation to run as a Window application) up and running to recognize speech input from the microphone using SAPI 5.3.

Caveat: When using SAPI 5.3 on Windows XP in a WPF Window application, it appears that you cannot use the more complex SpeechRecognitionEngine class, but have to resort to using the SpeechRecognizer class. One of the limitations that this entails is that you cannot specify an audio file as an input into the recognizer.

Install .NET Framework 3.0 from here.
(Optional) If you want the additional tools for Visual Studio 2005 to facilitate development using .NET Framework 3.0, install the following two components:

Microsoft® Windows® Software Development Kit for Windows Vista™ and .NET Framework 3.0 Runtime Components (only the Documentation is needed for installing the next component)
Visual Studio 2005 extensions for .NET Framework 3.0 (WCF & WPF), November 2006 CTP (provides support for visually editing XAML files)

Create a new C# Windows Application (WPF) in Visual Studio 2005.
In the Solution Explorer, right click on References under your project node, and select Add Reference....
In the .NET tab, select System.Speech (verify it's version 3.0.0.0), and click OK.
Double click Window1.xaml (if you didn't do Step 2 above, then you'll have to right click on it, select Open with..., and choose XML editor), and add the following snippet inside the <Grid> </Grid> element:
```
<ScrollViewer>
<TextBox x:Name="result_textBox" TextWrapping="WrapWithOverflow"
  ScrollViewer.CanContentScroll="True"></TextBox>
</ScrollViewer>
```
Change your Window1.xaml.cs code to the following:

using System;
using System.Speech;
using System.Speech.Recognition;

namespace SimpleSAPI_5_3
{
    public partial class Window1 : System.Windows.Window
    {
        // whether to use the command and control grammar or the dictation grammar
        bool commandAndControl = false;
        SpeechRecognizer _speechRecognizer;

        public Window1()
        {
            InitializeComponent();

            // set up the recognizer
            _speechRecognizer = new SpeechRecognizer();
            _speechRecognizer.Enabled = false;
            _speechRecognizer.SpeechRecognized +=
          new EventHandler<SpeechRecognizedEventArgs>(_speechRecognizer_SpeechRecognized);

            // set up the dictation grammar
            DictationGrammar dictationGrammar = new DictationGrammar();
            dictationGrammar.Name = "dictation";
            dictationGrammar.Enabled = true;

            // set up the command and control grammar
            Grammar commandGrammar = new Grammar(@"grammar.xml");
            commandGrammar.Name = "main command grammar";
            commandGrammar.Enabled = true;

            // activate one of the grammars if we don't want both at the same time
            if (commandAndControl)
                _speechRecognizer.LoadGrammar(commandGrammar);
            else
                _speechRecognizer.LoadGrammar(dictationGrammar);
        }

        protected override void OnClosing(System.ComponentModel.CancelEventArgs e)
        {
            _speechRecognizer.UnloadAllGrammars();
            _speechRecognizer.Dispose();
        }

        void _speechRecognizer_SpeechRecognizediobject sender, SpeechRecognizedEventArgs e)
        {
            result_textBox.AppendText(e.Result.Text + "\n");
        }
    }
}

Note that the dictation grammar and the command-and-control grammar can be both active at the same time.

SAPI 5.3 uses W3C's Speech Recognition Grammar Specification (SRGS) Version 1.0 for its grammar files (see here for the grammar specification). To use a command-and-control grammar, set commandAndControl to true, and save the following file as grammar.xml in the same directory as the executable:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                "http://www.w3.org/TR/speech-grammar/grammar.dtd">
<!-- the default grammar language is US English -->
<grammar xmlns="http://www.w3.org/2001/06/grammar"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://www.w3.org/2001/06/grammar
                                http://www.w3.org/TR/speech-grammar/grammar.xsd"
            xml:lang="en-US" version="1.0" root="command">
  <rule id="command" scope="public">
    <one-of>
      <item>selected</item>
      <item>interface</item>
      <item>default</item>
    </one-of>
  </rule>
</grammar>

Sunday, February 18, 2007

Working with grammars and recognition contexts

Here are several other places in the SAPI 5.1 SDK documentation that clarifies the relationship between grammars and recognition contexts and their appropriate usage:

Automation -> Sp[InProc/Shared]RecoContext

"An application may have several recognition contexts open at the same time, each controlling a different part of the application."
"Applications may have more than one recognition context. In fact, it is recommended to have as many as makes sense."

Automation -> SpSharedRecoContext -> CreateGrammar

"A recognizer may have more than one grammar associated with it although they are usually limited to one each of two types: dictation and context free grammar (CFG)."

Automation -> ISpeechRecoGrammar

"Each ISpRecoGrammar object can contain both a context-free grammar (CFG) and a dictation grammar simultaneously."

Thursday, February 15, 2007

C# starter code for SAPI 5.1 speech recognition from audio file

Here's how you would change the code in the previous post to recognize speech input from an audio file instead of the microphone.

First, create a wav file with some utterance using an audio recording software. Most formats should be supported, but a good setting would be 44,100Hz 16-bit mono wav file.

Simply replace the part of the code between /****** BEGIN: set up recognition context *****/ and /****** END: set up recognition context *****/ with the following snippet:

Caveat: Recognizing from audio file only works with in-proc recognizer.

            /****** BEGIN: set up recognition context *****/
            result_textBox.AppendText("File mode\n");

            // create an audio file stream
            ISpeechFileStream sfs = new SpFileStreamClass();
            sfs.Open(@"recording.wav", SpeechStreamFileMode.SSFMOpenForRead, false);

            // create the recognition context
            recoContext = new SpeechLib.SpInProcRecoContext();
            recoContext.Recognizer.AudioInputStream = sfs;
            ((SpInProcRecoContext)recoContext).Recognition +=
                new _ISpeechRecoContextEvents_RecognitionEventHandler(RecoContext_Recognition);
            /****** END: set up recognition context *****/

Wednesday, February 14, 2007

C# starter code for SAPI 5.1 speech recognition from microphone

Here are the steps for getting a bare-bones C# program up and running to recognize speech input from the microphone using SAPI 5.1.

Caveat: SAPI 5.1 does not work under a C# console application, due to the Automation API's dependence on Windows' Message Pump, so you have to create a Form-based application.

Create a new C# Windows Application in Visual Studio 2005.
In the Solution Explorer, right click on References under your project node, and select Add Reference....
Click on the COM tab, select Microsoft Speech Object Library (verify it's version 5.0), and click OK.
Double click Form1.cs, and add a TextBox control, set its Multiline behavior property to True, change the Name design property to "result_textBox", and resize the control on the Form to an appropriate size (this will be where the recognized text will be output).
Change your Form1.cs code to the following:

using System.Windows.Forms;
using SpeechLib;

namespace SimpleSAPI
{
    public partial class Form1 : Form
    {
        // whether to use the command and control grammar or the dictation grammar
        bool commandAndControl = false;
        ISpeechRecoContext recoContext;
        ISpeechRecoGrammar grammar;

        public Form1()
        {
            InitializeComponent();
        }

        protected override void OnLoad(System.EventArgs e)
        {
            /****** BEGIN: set up recognition context *****/
            result_textBox.AppendText("Dictation mode\n");

            // create the recognition context
            recoContext = new SpeechLib.SpSharedRecoContext();
            ((SpSharedRecoContext)recoContext).Recognition +=
                new _ISpeechRecoContextEvents_RecognitionEventHandler(RecoContext_Recognition);
            /****** END: set up recognition context *****/

            // set up the grammar
            grammar = recoContext.CreateGrammar(0);

            // set up the dictation grammar
            grammar.DictationLoad("", SpeechLoadOption.SLOStatic);
            grammar.DictationSetState(SpeechRuleState.SGDSInactive);

            // load the command and control grammar
            grammar.CmdLoadFromFile(@"grammar.xml", SpeechLoadOption.SLOStatic);
            grammar.CmdSetRuleIdState(0, SpeechRuleState.SGDSInactive);

            // activate one of the grammars if we don't want both at the same time
            if (commandAndControl)
                grammar.CmdSetRuleIdState(0, SpeechRuleState.SGDSActive);
            else
                grammar.DictationSetState(SpeechRuleState.SGDSActive);
        }

        protected override void OnClosing(System.ComponentModel.CancelEventArgs e)
        {
            recoContext.State = SpeechRecoContextState.SRCS_Disabled;
        }

        void RecoContext_Recognition(int StreamNumber, object StreamPosition,
                                     SpeechRecognitionType RecognitionType, ISpeechRecoResult Result)
        {
            result_textBox.AppendText(Result.PhraseInfo.GetText(0, -1, true) + "\n");
        }
    }
}

Note that the dictation grammar and the command-and-control grammar can be both active at the same time within the same speech recognition context.

To use a command-and-control grammar, set commandAndControl to true, and save the following file as grammar.xml in the same directory as the executable:

<GRAMMAR LANGID="409">
    <RULE NAME="toplevel" TOPLEVEL="ACTIVE">
        <L>
            <P>selected</P>
            <P>interface</P>
            <P>default</P>
        </L>
    </RULE>
</GRAMMAR>

Note that the LANGID should be set to 409, it appears that is the ID for English.

Tuesday, February 13, 2007

Understanding SAPI 5.1

The SAPI SDK 5.1 comes with a documentation file (or you can download just the documentation from here) in the form of Windows Help, but it's not quite easily navigable.

Some good places to start in the Contents tab (after opening Start -> All Programs -> Microsoft Speech SDK 5.1 -> Microsoft Speech SDK 5.1 Help) are:

Automation -> Sp[Shared/InProc]Recognizer:
Description of the interface to the underlying speech recognition engine and their different types (shared versus in-process).
Automation -> Sp[Shared/InProc]RecoContext:
A nice description of what "Recognition Contexts" are, and how one should create as many of them as appropriate for the application.
Automation -> ISpeechPhraseRule -> Code Example:
Lists the properties that can be queried on a phrase rule that was recognized, including rule name and confidence values.
Automation -> ISpeechPhraseElement -> Code Example:
Lists the properties that can be queried on a phrase element, including confidence values.
Automation -> ISpeechPhraseProperty -> Confidence:
Example of how the confidence values can be extracted along with its corresponding property name.
Automation -> ISpeechAlternate:
A way to get at a list of alternate phrase candidates for dictation mode recognition.
Automation -> Sp[Shared/InProc]RecoContext (Events):
The list of events that the recognition context can receive, and thus the clients can listen for.
Application-Level Interfaces -> Grammar Compiler Interfaces -> Text Grammar Format:
Description of the context-free grammar format used for command and control (as opposed to dictation) recognition.
White Papers -> SAPI 5.0 SR Properties White Paper:
The list of recognition engine properties that can be queried and set using the SetPropertyNumber method of Sp[Shared/InProc]Recognizer class, including the confidence thresholds.

Tuesday, February 6, 2007

Speech Recognition Profile Manager Tool

Once you have gone through the trouble to train the speech profile, (control panel -> speech -> train profile) you can save and load it on another machine using the Speech Recognition Profile Manager Tool. It is also useful if you are running user studies and want to make sure to backup your training sets for analysis.

http://www.microsoft.com/downloads/details.aspx?FamilyID=cd72250f-2e02-430e-8f99-e1acae760564&DisplayLang=en

Where to seek help

A good place to look for help from the developer community is the microsoft.public.speech_tech.sdk newsgroup:

http://groups.google.com/group/microsoft.public.speech_tech.sdk/topics

SAPI 5.1 versus 5.3

How to check which version of SAPI you have:
Go to C:\Program Files\Common Files\Microsoft Shared\Speech and right click on sapi.dll and select Properties. Click on the Version tab to see if you have version 5.1.

SAPI 5.1 overview
http://www.answers.com/topic/speech-application-programming-interface
Provides a nice overview of the different versions of SAPI.
http://www.microsoft.com/speech/techinfo/apioverview/
Brief overview of SAPI 5.1 and SAPI SDK 5.1 from Microsoft.

SAPI SDK 5.1
http://www.microsoft.com/speech/download/old/sapi5.asp
Download the "previous" version (before the current SAPI 5.3) of the SDK here.

Sapi 5.1: clarifications there of
http://blindprogramming.com/pipermail/programming_blindprogramming.com/2006-February/004794.html
E-mail thread that discusses how to use SAPI 5.1. Key take aways:

Can't use SAPI 5.3 on Windows XP.
You can use SAPI 5.1 with managed code in VS 2005 by adding it as a COM reference, and VS 2005 will automatically create the COM interop code.

The Windows Vista Developer Story: Speech
http://msdn2.microsoft.com/en-us/library/aa480207.aspx
Describes the features of SAPI 5.3.

All the Cool Developers use Speech APIs
http://blogs.msdn.com/chuckop/archive/2006/11/24/microsoft-speech-api-sdk.aspx
Describes the difference between SAPI 5.1 and 5.3.

Speech UI development on Windows