WPE speech dereverberation

Default configuration

The WPE package contains configuration files with default configuration settings/local.m for 8 channels case, as shown below

%%
%% ======================================================================
%%
%% Sample configurations file for WPE
%%
%% Copyright (c) 2015 Nippon Telegraph and Telephone corporation (NTT).
%% All rights reserved.
%% By Takuya Yoshioka, Marc Delcroix 24-06-2015.
%% ======================================================================

% Basic parameters
%----------------------------------------------------------------------

fs    = 16000;     %% Sampling frequency

num_mic = 8;       %% Number of channels

num_out = num_mic; %% Number of outputs (should be <= microphones).
		   %% set to 1 if output a single channel

blk_len = 100;      %% Block length (in sec). Set to a large value for
                   %% utterance-batch processing
		   
opt_blk_sz = 1;    %% Optimize the block size for block batch processing
                   %%  to speedup computations

% Signal analysis parameters
%----------------------------------------------------------------------

analy_param = struct('win_size'  , 512, ...
		     'shift_size', 128, ...
		     'win'       , hanning(512));

% Dereverberation parameters 
%----------------------------------------------------------------------

%% Parameters of prediction filter
%% [number of filter coefficient; prediction delay; upper frequency]
%% It is possible to set different filter parameters for different
%% frequency bands e.g. filter length of 10 up to 500 Hz and 6 for the rest 
%% channel_setup = [10, 6;
%%                 3, 3;
%%                 500, inf]
channel_setup = [10; ...
		 3; ...
		 inf];

%% Dereverberation filter configuration
%% 'channel_setup' consists of the prediction filter settings set above
%% 'p_channel'     sets the index of the target channel for prediction
%% 'speech_order'  sets the speech lpc order
ssd_param = struct('channel_setup', channel_setup, ...
		   'p_channel'    , [1 : num_out], ...
		   'speech_order' , 20);

%% Optimization parameters
%% 'max_iter'  number of iterations
%% 'spcorr'    structure of the speech covariance matrix ('scaleye'
%%             corresponds to diagonal)
%% 'scaling'   gain between input and output
%% 'forget'    forgetting factor for the correlation matrix (the larger
%%             the more the past observations are remembered)
ssd_conf = struct('max_iter', 3, ...
		  'spcorr'  , 'scaleye', ...
		  'scaling' , 1, ...
		  'forget'  , 0.7);

%% Enhancement configuration
%% 'method'   sets the method used to suppress late reverberation (either
%%            linear filtering 'lti' or spectral subtraction 'ss'
%% 'osub'     oversubtraction (for 'ss' only)
%% 'scal'     scaling (for 'ss' only)
%% 'floor'    flooring (for 'ss' only)
enh_conf = struct('method', 'lti', ...
		  'osub'  , 1.0, ...
		  'scal'  , 2, ...
		  'floor' , -80);

Changing the configuration

Here we review some main settings that may need to be changed when using WPE for different tasks.
First, when using WPE to process a new corpus, you may need to change the file settings/arrayname.lst that setups the naming convention for reading multi-channel input data.

In addition, the principal options that may need to be changed are,

The sampling frequency with which dereverberation is performed (fs). The sampling frequency of the recordings is automatically down- or up-sampled to fs if it does not match fs.
The analysis window size and shift ('win_size' and 'shift_size' in analy_param).
'win_size' and 'shift_size' should be modified if the sampling frequency is changed.
Note that the numbers are expressed in number of taps.
The number of microphones (num_mic).
num_mic should be set to the number of microphone available, by default the number of output channels (num_out) matches the number of input channels
The processing block length (blk_len).
blk_len should be set to a large value if processing in utterance batch mode. When processing long audio files, you can use block batch processing by setting the block length to a value shorter than the signal length.
The prediction filter length (the first element of channel_setup).
A useful rule of thumb for the prediction filter configuration is L~(M-1)/(P-1), where L is the sum of the prediction filter length and the prediction delay, M is the reverberation time and P is the number of microphones. In this example, P = 8, L = 10+3 and M < 92 taps or 760 msec given the 32-msec window with the 8-msec stride. This guide is derived from the MINT theorem. In practice, it would be better to use a prediction filter slightly shorter than (M-1)/(P-1) for automatic speech recognition to prevent the prediction filter from distorting your target signal or amplifying background noise.

Example of configurations

In our experiments, we used the following configurations.

Configuration for REVERB challenge task

Marc Delcroix, Takuya Yoshioka, Atsunori Ogawa, Yotaro Kubo, Masakiyo Fujimoto, Ito Nobutaka, Keisuke Kinoshita, Miquel Espi, Takaaki Hori, Tomohiro Nakatani, and Atsushi Nakamura, "Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge," in Proceedings of the 2014 REVERB Workshop, May 2014.
Marc Delcroix, Takuya Yoshioka, Atsunori Ogawa, Yotaro Kubo, Masakiyo Fujimoto, Ito Nobutaka, Keisuke Kinoshita, Miquel Espi, Takaaki Hori, Tomohiro Nakatani, "Strategies for distant speech recognition in reverberant environments," EURASIP Journal on Advances in Signal Processing, 2015.

8 channels case

% Basic parameters
%----------------------------------------------------------------------

fs    = 16000;     %% Sampling frequency

num_mic = 8;       %% Number of channels

num_out = num_mic; %% Number of outputs (should be <= microphones).
		   %% set to 1 if output a single channel

blk_len = 30;      %% Block length (in sec). Set to a large value for
                   %% utterance-batch processing
		   
opt_blk_sz = 1;    %% Optimize the block size for block batch processing
                   %%  to speedup computations

% Signal analysis parameters
%----------------------------------------------------------------------

analy_param = struct('win_size'  , 512, ...
		     'shift_size', 128, ...
		     'win'       , hanning(512));

% Dereverberation parameters 
%----------------------------------------------------------------------

single channel case

% Basic parameters
%----------------------------------------------------------------------

fs    = 16000;     %% Sampling frequency

num_mic = 1;       %% Number of channels

num_out = num_mic; %% Number of outputs (should be <= microphones).
		   %% set to 1 if output a single channel

% Signal analysis parameters
%----------------------------------------------------------------------

analy_param = struct('win_size'  , 512, ...
		     'shift_size', 128, ...
		     'win'       , hanning(512));

% Dereverberation parameters 
%----------------------------------------------------------------------

Configuration for CHiME3 challenge task (6ch)

Takuya Yoshioka, Ito Nobutaka, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, and Tomohiro Nakatani, "The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices," Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015.

% Basic parameters
%----------------------------------------------------------------------

fs    = 16000;     %% Sampling frequency

num_mic = 6;       %% Number of channels

num_out = num_mic; %% Number of outputs (should be <= microphones).
		   %% set to 1 if output a single channel

% Signal analysis parameters
%----------------------------------------------------------------------

analy_param = struct('win_size'  , 512, ...
		     'shift_size', 128, ...
		     'win'       , hanning(512));

% Dereverberation parameters 
%----------------------------------------------------------------------