NtKinect: Kinect V2 C++ Programming with OpenCV on Windows10

How to recognize speech with Kinect V2


2016.07.18: created by
2016.07.31: revised by
2016.09.13: revised by
2019.08.31: revised by
Japanese English

After reading this article, we recommend that you read " NtKinect: How to make Kinect V2 speech recognition as DLL and using it from Unity " and try using Kinect V2 speech recognition from Unity.

To Table of Contents

Prerequisite knowledge


[Notice] (Aug/31/2019 added) When you use Visual Studio 2019 and Visual Studio 2017 (later version), the following errors will occur and compilation will fail.

Error  C2440  cannot covert from 'default argument': 'const wchar_t [1]' to 'BSTR'.  KinectV2C:\Program Files\Microsoft SDKs\Speech\v11.0\Include\sapi.h   12519
This is because the handling of string literals has become strict. You can compile normally when compiling with the /Zc:strictStrings- (disable string literal type conversion) option.
How to set compile options
Solution Explorer  -> right drag on the project name (KinectV2)  -> Properties -> Configuration Properties -> C/C++ -> Command Line -> Additional Options -> /Zc:strictStrings-


Prerequisite necessary items

The following settings are required to perform speech recognition with Kinect V2. Especially, there are many cases where the program does not operate properly by including another "sapi.h" somewhere in the system, so please pay attention to Visual Studio project settings.


Recognizing Speech

If you define USE_SPEECH constant before including NtKinect.h, the functions and variables of NtKinect for recognizing speech become effective.

  #define USE_SPEECH
  #include "NtKinect.h"
  ...
NtKinect

NtKinect's functions for Speech Recognition

type of return value function name descriptions
void setSpeechLang(string& lang , wstring& grxml ) Specify laguage lang and Word file grxml to recognize.
When this function is not called, the default values are "ja-JP" and L"Grammar_jaJP.grxml", respectively.
void startSpeech() Start speech recognition.
void stopSpeech() Stop speech recognition.
bool setSpeech() Called after startSpeech(), this function tries speech recognition on the most recent audio data and set the values of the following member variables.
typevariable namedescriptions
bool recognizedSpeechFlag indicating whether speech recognition is successful.
wstring speechTagTag of speech recognition result.
wstring speechItemItem of speech recognition result.
float speechConfidenceConficence of speech recognition result.
NtKinect

NtKinect's member variables for Speech Recognition

type variable name descriptions
bool recognizedSpeech Flag indicating whether speech recognition is successful.
It is true if speechConfidence is over speechThreshold.
wstring speechTag When the speech can be recognized, "tag" of the word is set.
This "tag" is the value specified by tag field in the grxml file.
wstring speechItem When the speech can be recognized, "item" of the word is set.
This "item" is the value specified by item field in the grxml file.
float speechConfidence Confidence of the result of recognized speech. When this values is larger than confidenceThreshold , it is judged that the speech can be recognized.
float confidenceThreshold Threshold value to recognize speech. You can set the value by assigning directly to the variabe. The default value is 0.3.

How to write program

  1. Install KinectV2's speech recognition SDK "Speech Platform SDK v11".
  2. http://www.microsoft.com/en-us/download/details.aspx?id=27226
    64bit version
      x64_MicrosoftSpeechPlatformSDK\MicrosoftSpeechPlatformSDK.msi
    



    Click the downloaded msi file and install it.


  3. In addition, install the Runtime of Speech SDK.
  4. Click on the following msi file created by installing Speech SDK, and execute it.

    C:\Program Files\Windows\SDKs\Speech\v11.0\Redist\SpeechPlatformRuntime.msi
    




  5. Download and install the acoustic model prepared for each language.
  6. http://www.microsoft.com/en-us/download/details.aspx?id=43662
       MSKinectLangPack_enUS.msi  English (US)
       MSKinectLangPack_jaJP.msi  Jpanaese (JP)
    






    Click on the two downloaded msi files and execute them.

  7. Restart Windows after installing "Speech SDK" itself, "Speech SDK Runtime", and two acoustic models of KinectLangPack, "enUS" and "jaJP".
  8. Restart may not be necessary, but just in case.

  9. Start using the Visual Studio's project KinectV2_audio.zip of " NtKinect: How to get and record audio with Kinect V2
  10. " .
  11. Prepare a file describing the words you want to recognize for each language.
  12. It is Japanese to recognize this time, but let's add both files to the project, considering changing to make English recognize in the future.

    File for usEN :Grammar_enUS.grxml
    File for jaJP :Grammar_jaJP.grxml

    First, place the grxml file in the project's source folder, in this case "KinectV2_speech/KinectV2/". Then, right-click on "Resource file" in "Solution Explorer" and select "Existing item" from "Add".




    When editing these files, let's save with UTF-8 format. For grammar specifications, please refer to https://www.w3.org/TR/speech-grammar/ .

  13. Add KinectAudioStream.h and KinectAudioStream.cpp to the project.
  14. First, copy KinectAudioStream.h and KinectAudioStream.cpp to the project's source folder, in this case "KinectV2_speech/KinectV2/". Then, right-click on "Resource file" in "Solution Explorer" and select "Existing item" from "Add".







  15. Change the projct settings.
  16. Right-click on "Project name (here KinectV2)" from "Solution Explorer" and select "Properties" from the menu.




    Add the Speech SDK include path $(ProgramW6432)\Microsoft SDKs\Speech\v11.0\Include; to the first place of the project's include directories.
    "Configuration Properties" -> "VC++ directories" -> "Include directory" -> Add $(ProgramW6432)\Microsoft SDKs\Speech\v11.0\Include; to the beginning of characters.

    Add the Speech SDK library path $(ProgramW6432)\Microsoft SDKs\Speech\v11.0\Lib; to the first place of the project's library directories.
    "Configuration Properties" -> "VC++ directories" -> "Library directory" -> Add $(ProgramW6432)\Microsoft SDKs\Speech\v11.0\Lib; to the beginning of characters.







    In the setting of the project properties, add the sapi.lib library.
    "Configuration properties" -> "Linker" -> "Input" -> "Additional dependency files" -> Add sapi.lib




  17. Change the contents of main.cpp as follows.
  18. Define USE_SPEECH constant before including NtKinect.h.

    Since the speech recognition program operates in multi-threading, CoInitializeEx(NULL, COINIT_MULTITHREADED) is called at the beginning of the program, and finally CoUninitialize() is called.

    For the result type of speech recognition by Microsoft's Speech SDK is wchar_t, NtKinect holds the result as wstring data type. Because a wstring data is not displayed in cout, so in this example it is outputed to wcout. Be careful as wcout will not work unless you set locale.

        std::wcout.imbue(std::locale(""));   // set locale
    

    Please note that the literal of wstring is prefixed with 'L' like L"EXIT". Also, please avoid using cout and wcout at the same time.

    If you do kinect heavy processing, the capture of audio will be intermittent and it will not be able to recognize the speech so well. Therefore, in this example, processing of other sensors of kinect and display of OpenCV window are omitted.

    In order to obtain the keyboard input on the console screen in real time, we use _kbhit() and _getch() functions in conio.h.

    main.cpp
    #include <iostream>
    #include <sstream>
    
    #include <conio.h>
    
    #define USE_SPEECH
    #include "NtKinect.h"
    
    using namespace std;
    
    void doJob() {
      NtKinect kinect;
      kinect.startSpeech();
      std::wcout.imbue(std::locale(""));
      while (1) {
        kinect.setSpeech();
        if (kinect.recognizedSpeech) {
          wcout << kinect.speechTag << L" " << kinect.speechItem << endl;
        }
        if (kinect.speechTag == L"EXIT") break;
        if (_kbhit() && _getch() == 'q') break;
      }
      kinect.stopSpeech();
    }
    
    int main(int argc, char** argv) {
      try {
        ERROR_CHECK(CoInitializeEx(NULL, COINIT_MULTITHREADED));
        doJob();
        CoUninitialize();
      } catch (exception &ex) {
        cout << ex.what() << endl;
        string s;
        cin >> s;
      }
      return 0;
    }
    
  19. When you run the program, RGB images are displayed. Exit with 'q' key.
  20. The result of speech recognition is output to the console. In the current grxml, 8 japanese words are registerd, "AKA", "AKAIRO", "AO", "AOIRO", "MIDORI", "MIDORIIRO", "SHUURYOU", "OWARI". Each word means "red", "red color", "blue", "blue color", "green", "green color", "end", "end", respectively. When the word "SHUURYOU" or "OWARI" is recognized, the program ends.




    Grammar_jaJP.grxml
    <?xml version="1.0" encoding="utf-8" ?>
    <grammar version="1.0" xml:lang="ja-JP" root="rootRule" tag-format="semantics/1.0-literals" xmlns="http://www.w3.org/2001/06/grammar">
      <rule id="rootRule">
        <one-of>
          <item>
            <tag>RED</tag>
            <one-of>
              <item> 赤 </item>
              <item> 赤色 </item>
            </one-of>
          </item>
          <item>
            <tag>GREEN</tag>
            <one-of>
              <item> 緑 </item>
              <item> 緑色 </item>
            </one-of>
          </item>
          <item>
            <tag>BLUE</tag>
            <one-of>
              <item> 青 </item>
              <item> 青色 </item>
            </one-of>
          </item>
          <item>
            <tag>EXIT</tag>
            <one-of>
              <item> 終わり </item>
              <item> 終了 </item>
            </one-of>
          </item>
        </one-of>
      </rule>
    </grammar>
    
  21. Please click here for this sample project KinectV2_speech.zip
  22. Since the above zip file may not include the latest "NtKinect.h", Download the latest version from here and replace old one with it.

  23. Let's try it.
  24. Currently the threshold is 0.3. Let's try to lower it to 0.1 to see how the misrecognition increases.

    Let's change the recognition language to English.

        NtKinect kinect;
        kinect.setSpeechLang("en-US", L"Grammar_enUS.grxml");   // <-- Add this line to recognize English.
        kinect.startSpeech();
    
    Grammar_enUS.grxml
    <?xml version="1.0" encoding="utf-8" ?>
    <grammar version="1.0" xml:lang="en-US" root="rootRule" tag-format="semantics/1.0-literals" xmlns="http://www.w3.org/2001/06/grammar">
      <rule id="rootRule">
        <one-of>
          <item>
            <tag>RED</tag>
            <one-of>
              <item> Red </item>
            </one-of>
          </item>
          <item>
            <tag>GREEN</tag>
            <one-of>
              <item> Green </item>
            </one-of>
          </item>
          <item>
            <tag>BLUE</tag>
            <one-of>
              <item> Blue </item>
            </one-of>
          </item>
          <item>
            <tag>EXIT</tag>
            <one-of>
              <item> Exit </item>
              <item> Quit </item>
              <item> Stop </item>
            </one-of>
          </item>
        </one-of>
      </rule>
    </grammar>
    
  25. Let's use Kinect V2 speech recognition with Unity. Please see here for how to do it.


http://nw.tsuda.ac.jp/