NeurIPS2021

The Tufts fNIRS Mental Workload Dataset & Benchmark for Brain-Computer Interfaces that Generalize

Zhe Huang, Liang Wang, Giles Blaney, Christopher Slaughter, Devon McKeon, Ziyu Zhou, Robert J. K. Jacob, Michael C. Hughes

29 citations

Publisher

Abstract

Functional near-infrared spectroscopy (fNIRS) promises a non-intrusive way to measure real-time brain activity and build responsive brain-computer interfaces. A primary barrier to realizing this technology's potential has been that observed fNIRS signals vary significantly across human users. Building models that generalize well to never-before-seen users has been difficult; a large amount of subjectspecific data has been needed to train effective models. To help overcome this barrier, we introduce the largest open-access dataset of its kind, containing multivariate fNIRS recordings from 68 participants, each with labeled segments indicating four possible mental workload intensity levels. Labels were collected via a controlled setting in which subjects performed standard n-back tasks to induce desired working memory levels. We propose a benchmark analysis of this dataset with a standardized training and evaluation protocol, which allows future researchers to report comparable numbers and fairly assess generalization potential while avoiding any overlap or leakage between train and test data. Using this dataset and benchmark, we show how models trained using abundant fNIRS data from many other participants can effectively classify a new target subject's data, thus reducing calibration and setup time for new subjects. We further show how performance improves as the size of the available dataset grows, while also analyzing error rates across key subpopulations to audit equity concerns. We share our open-access Tufts fNIRS to Mental Workload (fNIRS2MW) dataset 1 and open-source code 2 as a step toward advancing brain computer interfaces. 1 https://tufts-hci-lab.github.io/code_and_datasets/fNIRS2MW.html [License: CC-BY-4.0] 2 https://github.com/tufts-ml/fNIRS-mental-workload-classifiers [License: MIT] 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks. input passively, without conscious effort from the user. The fNIRS input drives subtle changes in the user interface or task tailored to the user's moment-to-moment measured mental state. We have demonstrated successful prototype systems (Solovey et al., 2015; Bosworth et al., 2019) . In this work, we consider the problem of developing an effective classifier of mental workload intensity given a short window of fNIRS time-series data. While several efforts have pursued this task before for fNIRS (Coffey et al., 2012; Herff et al., 2014; Aghajani et al., 2017) as well as other BCI sensing technologies (Yin and Zhang, 2017; Saadati et al., 2020) , the challenge of building a classifier that can generalize to new users remains difficult. Addressing this generalization problem would reduce the required effort to setup a new subject, which would be valuable in a practical BCI setting. Three barriers stand in the way of effective generalization: a lack of large open-access datasets, a lack of standardized protocols following best practices for evaluation, and high variability across subjects. One common barrier to effective fNIRS-based BCIs is the lack of available data. Previous work typically collects proprietary datasets from only 10-30 subjects. Even in a paid research study, collecting more than an hour of data per user can be difficult because the sensor can eventually become uncomfortable. Most studies do not share data due to the logistical difficulties of human subjects research. While a few open-access datasets for fNIRS data exist (see Table 1 ), none has more than 30 subjects. Furthermore, the demographic composition of existing data is not always accessible and (if reported) often homogeneous, complicating the goal of BCI for a diverse range of people. Using homogeneous existing datasets to train models is particularly concerning given that NIRS sensor performance may be sensitive to the user's skin color (Wassenaar and Van den Brand, 2005; Couch et al., 2015) as well as dark hair (Chen et al., 2020) . This might lead to poor performance for some users, reminiscent of racial disparities observed in face recognition (Buolamwini and Gebru, 2018) and disease detection from pulse oximetry (Sjoding et al., 2020) . Improving the diversity and auditability (Raji et al., 2020) of open data is crucial to achieving BCI that works for many people. Another barrier to progress is the lack of a standardized evaluation protocol. Without standardized protocols, different papers may not follow the very same experimental design, making results incomparable and preventing scientific progress. While much is known about best practices for hyperparameter selection and heldout performance estimation in the time series context (Racine, 2000; Mozetič et al., 2018; Cerqueira et al., 2020) , without a standard protocol later work may not follow these practices. For example, when evaluating models meant to generalize across subjects, performance should only be reported using never-before-seen subjects. To evaluate a model trained o