C#による音声認識の実装ガイド

環境設定の基本

音声認識には様々な手法が存在しますが、Windows標準のSystem.Speech.Recognitionは精度に課題があります。効果的かつ実用的なソリューションとして、Whisperモデルを基にしたWhisper.Netライブラリを採用します。

NuGetパッケージマネージャーで以下のパッケージをインストール:

Install-Package Whisper.Net -Version 1.9.0
Install-Package Whisper.Net.Runtime -Version 1.9.0
Install-Package NAudio -Version 2.2.1

主要処理の実装

音声認識の基本フロー

private const string AudioOutput = "audio_temp.wav";
private static ConfigLoader config = new ConfigLoader("settings.conf");

static void Main()
{
    var recognizer = new AudioRecognizer(config.GetString("modelPath"));
    
    while (true)
    {
        Console.WriteLine("スペースキーで録音開始/終了");
        if (Console.ReadKey(true).Key == ConsoleKey.Spacebar)
        {
            if (!isRecording) StartRecording();
            else StopRecording();
            
            isRecording = !isRecording;
            
            var result = recognizer.ConvertToText(AudioOutput).Result;
            Console.WriteLine(result);
        }
    }
}

音声キャプチャ処理

private static WaveInEvent audioInput;

private static void StartRecording()
{
    audioInput = new WaveInEvent
    {
        DeviceNumber = 0,
        WaveFormat = new WaveFormat(16000, 1)
    };
    
    var fileWriter = new WaveFileWriter(AudioOutput, audioInput.WaveFormat);
    
    audioInput.DataAvailable += (s, e) => 
    {
        fileWriter.Write(e.Buffer, 0, e.BytesRecorded);
    };
    
    audioInput.RecordingStopped += (s, e) => 
    {
        fileWriter.Dispose();
        audioInput.Dispose();
    };
    
    audioInput.StartRecording();
}

音声認識エンジン

public class AudioRecognizer : IDisposable
{
    private readonly WhisperFactory factory;
    private readonly WhisperProcessor processor;
    
    public AudioRecognizer(string modelPath)
    {
        factory = WhisperFactory.FromPath(modelPath);
        processor = factory.CreateBuilder()
            .WithLanguage("ja")
            .WithPrompt("日本語の音声認識サンプルです。")
            .Build();
    }
    
    public async Task<string> ConvertToText(string filePath)
    {
        var output = new StringBuilder();
        using var stream = File.OpenRead(filePath);
        
        await foreach (var segment in processor.ProcessAsync(stream))
        {
            output.AppendLine($"{segment.Start:hh\\:mm\\:ss} - {segment.Text}");
        }
        
        return output.ToString();
    }
    
    public void Dispose() => processor?.Dispose();
}

入力データの処理

Whisperモデルは16kHzのWAVファイルを必要とします。NAudioを使用したサンプリングレート変換例:

public void ConvertAudio(string inputPath, string outputPath)
{
    using var reader = new AudioFileReader(inputPath);
    var newFormat = new WaveFormat(16000, 16, 1);
    
    using var converter = new WaveFormatConversionStream(newFormat, reader);
    WaveFileWriter.CreateWaveFile(outputPath, converter);
}

実装上の注意点

Tinyモデル(74MB)は処理速度が速いが精度に制限あり
長時間の音声は分割処理を推奨
多言語混在認識はモデルサイズにより精度が変動

タグ: csharp WhisperNet NAudio 音声認識 AIモデル

6月27日 17:33 投稿

異端開発室