WPE speech dereverberation

  • Home
  • Download
  • Licence
  • Releases
  • Configurations
  • References
  • Contact

Default configuration

The WPE package contains configuration files with default configuration settings/local.m for 8 channels case, as shown below

Changing the configuration

Here we review some main settings that may need to be changed when using WPE for different tasks.
First, when using WPE to process a new corpus, you may need to change the file settings/arrayname.lst that setups the naming convention for reading multi-channel input data.

In addition, the principal options that may need to be changed are,
  • The sampling frequency with which dereverberation is performed (fs). The sampling frequency of the recordings is automatically down- or up-sampled to fs if it does not match fs.
  • The analysis window size and shift ('win_size' and 'shift_size' in analy_param).
    'win_size' and 'shift_size' should be modified if the sampling frequency is changed.
    Note that the numbers are expressed in number of taps.
  • The number of microphones (num_mic).
    num_mic should be set to the number of microphone available, by default the number of output channels (num_out) matches the number of input channels
  • The processing block length (blk_len).
    blk_len should be set to a large value if processing in utterance batch mode. When processing long audio files, you can use block batch processing by setting the block length to a value shorter than the signal length.
  • The prediction filter length (the first element of channel_setup).
    A useful rule of thumb for the prediction filter configuration is L~(M-1)/(P-1), where L is the sum of the prediction filter length and the prediction delay, M is the reverberation time and P is the number of microphones. In this example, P = 8, L = 10+3 and M < 92 taps or 760 msec given the 32-msec window with the 8-msec stride. This guide is derived from the MINT theorem. In practice, it would be better to use a prediction filter slightly shorter than (M-1)/(P-1) for automatic speech recognition to prevent the prediction filter from distorting your target signal or amplifying background noise.

Example of configurations

In our experiments, we used the following configurations.

Configuration for REVERB challenge task

  • Marc Delcroix, Takuya Yoshioka, Atsunori Ogawa, Yotaro Kubo, Masakiyo Fujimoto, Ito Nobutaka, Keisuke Kinoshita, Miquel Espi, Takaaki Hori, Tomohiro Nakatani, and Atsushi Nakamura, "Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge," in Proceedings of the 2014 REVERB Workshop, May 2014.
  • Marc Delcroix, Takuya Yoshioka, Atsunori Ogawa, Yotaro Kubo, Masakiyo Fujimoto, Ito Nobutaka, Keisuke Kinoshita, Miquel Espi, Takaaki Hori, Tomohiro Nakatani, "Strategies for distant speech recognition in reverberant environments," EURASIP Journal on Advances in Signal Processing, 2015.

8 channels case

single channel case

Configuration for CHiME3 challenge task (6ch)

  • Takuya Yoshioka, Ito Nobutaka, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, and Tomohiro Nakatani, "The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices," Proc. of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015.