Google's speech recognition API is the Cloud Speech API.
First of all, please apply for "Free Trial" of Cloud Speech API of Google Cloud Platform , and set it so that it can be used.
While reading " Quick Start: Learning in 5 minutes " in the " Google Cloud Speech API Document ", please send the sound described in "sync-request.json" by "curl" and check the procedure by which recognized speech is returned.
sync-request.json |
|
Send HTTP "Post" request with curl |
|
Get access token and use it to use Google Speech API with HTTP protocol. Please note that if you save the access token by redirect to a file as in the example above, the format of the file will be UTF-16 Encoding of little endian.
gcloud auth application-default print-access-token > token-file.txt |
In order to prevent your access tokens from leaking to others, it is recommended that the file be placed outside Visual Studio project.
[Caution]
You may get an error "Default Credential Authentification ..." when you get the access token.
That is because your
Google Application Default Credentials
is not set well.
The easiest way to get rid of this error is to create a Default_Credentials file by running the following command. When you run the command, the web browser is launched. Since Microsoft is too slow to use in Google Cloud Platform, so we recommend that you might change "Default Application" of "Web Browser" from Microsoft Edge to Google Chrome.
gcloud auth application-default login |
This explanation is not in the Google's "Quick Start" (as of Dec/25/2017), so it seems that there are many people who stumble.
[Caution 2]
You may get an error "Default Credential Authentification ..." when you get the access token.
There is also a way to set the path to the json file (
ynitta-XXXXX-XXXXXXXXX.json
in the above example
) of the downloaded service account in
the GOOGLE_APPLICATION_CREDENTIALS environment variable.
The sound acquired by Kinect V2 is saved as a WAVE file in the following format.
Property Name | Property Value |
---|---|
Format | WAVE_FORMAT_IEEE_FLOAT |
nChannels | 1 |
nSamplingPerSec | 16000 |
wBitsPerSample | 32 |
As the loss less audio formats that can be used in Google Speech API are only FLAC or LINEAR16 (as of Descember/20/2017). So we need to convert our WAVE_FORMAT_IEEE_FLOAT audio data into the above audio data format.
LINEAR16 is the "WAVE_FORMAT_PCM" format in which each audio sampling data is expressed as a 16 bits signed integer of linear value. So, it is easy to convert WAVE_FORMAT_IEE_FLOAT format data into it.
Each data of the WAVE_FORMAT_IEEE_FLOAT format is expressed as a floating point number between -1.0 and 1.0. Each data is aquired by 4 bytes as a 32 bits floating point number, multiply it by 0x7fff = 32767, and convert it to INT16.
Conversion of audio format (32bit float WAVE --> 16bit int WAVE) |
FLOAT *p = (FLOAT *) pointer ; INT16 *q = (INT16 *) pointer ; for (int i=0; i<size/4; i++) { *q++ = (INT16) (32767 * (*p++)); } |
Note that the WAVE file saved in "KinectV2_audio" project has 46 bytes of file format information at the beginning of the file and audio data will start thereafter.
The C++ REST SDK (code name "Casablanca") can not be get by NuGet in Visual Studio 2017. Therefore, we will use WinHttp to access the WWW server this time.
JSON data returned by Google Speech API as a result of speech recognition needs to be analyzed separately. In this example, we do not analyze JSON data fo simplicity of explanation.
Right-click on the project name "KinectV2" in the Solution Explorer and select "Properties".
At the "Properties" panel, check "Configuration" is "Release", "Active (Release)" or "All", and check "Platform" is "x64", "Active (x64)" or "All".
"Configuration Properties" --> "Link" --> "Input" --> "additional dependent files" --> add "Winhttp.lib".
Download the NtGoogleSpeech.h and place it in the folder where other souce files (such as main.cpp) are located. Then, add it to the project.
NtGoogleSpeech.h |
|
Change the main.cpp to fit your environment.
NtGoogleSpeech gs("C:\\Users\\nitta\\Documents\\GoogleSpeech\\token-file.txt"); |
You must pass the path to the accessToken file of GoogleSpeech to the constructor of NtGoogleSpeech class.
main.cpp |
|
Start recording with 'r' key and stop recording with 's' key. As for the file name, we acquire the time at the record starting and create a wav file with it as a file name (eg. "2016-07-18_09-16-32.wav"). Use the 'j' key or 'u' key to send the most recently recorded voice to the Google Speech API and save the analysis result to a file with extension ".txt". 'u' key will recognize speech as English ("en-US") and 'j' key as Japanese ("ja-JP").
The character string returned by speech recognition is utf-8. If Japanese recognition result are displayed as they are, it may appear as garbled characters depending on the environment.
Recording starts with 'r' key, and stops with 's' key. Recording status is displayed as "Recording" or "Stopped" at the upper left of the RGB image.
Use the 'j' key or 'u' key to recognize the latest recorded voice as Japanese or English.
The number (eg. 200, 401) displayed before recognition result json is the status code of HTTP access.
Recognition result of 2017-12-20_18-57-05.wav as "en-US".2017-12-20_18-57-05.txt |
|
2017-12-20_18-57-21.txt |
|
In case of expired access token |
|
Since the above zip file may not include the latest "NtKinect.h", Download the latest version from here and replace old one with it.