Augmenting hearing capabilities of the deaf with Natural Language Processing
Jacky Zhao
St George's School
Floor Location : S 051 N

Over 360 million people in the world have disabling hearing loss. One of the main impacts hearing loss has, is on an individual’s ability to communicate with others. This can have adverse affects on people, such as causing emotional distress, missed opportunities for employment or education, and huge productivity losses. Citing from the WHO, “[The cost of] Loss of productivity, due to unemployment and premature retirement among people with hearing loss, is conservatively estimated to cost $678 billion annually.” It is clear that this is a major problem in today’s world. Most people with hearing loss are able to communicate in the world through lip reading, sign language, and other visual cues; however, there are inherent problems in these solutions. Lipreading is very difficult to learn, often taking many years to learn, and even then, heavily depends on visual cues which are not present in a lot of day to day interactions. Sign language has similar problems, like being dependant on visual cues and taking lots of time to learn. It also introduces a host of other problems like sign language ‘accents’ or ‘dialects’ and a small slip of the hand can completely change the meaning of a sentence. Solutions such as hearing aids are not able to solve everything as those who are completely deaf are not able to take advantage of the amplification of sound as well coming with hefty price tags often in the thousands. Other solutions such as cochlear implants have dangerous medical procedures and have adverse effects for older citizens. This debilitates those with hearing loss and could create dangerous scenarios in which these people may not be aware of alarms or other auditory cues which could prove important to safety.

This project entailed creating a machine learning network to convert streams of audio data into a readable output of braille. The device is made from 6 5v solenoids connected to a Raspberry Pi wired to a microphone. The algorithm, implemented in Tensorflow, achieved a final 74.66% accuracy on the training subset of the TIMIT corpus and 42.18% accuracy on the test subset of TIMIT with 28 character classes, reaching the limitations of the network structure and hardware available. I used a Deep Long Short Term Memory network with 2 layers and 128 hidden cells per layer optimized with RMSProp on a learning rate of 1e-4 and decay of 0.9. To increase the generalization of the network, I used batch normalization and added gaussian noise of standard deviation ±0.15 to the inputs. Inputs are preprocessed by computing 13 Mel-frequency Cepstrum Coefficients and their first and second derivatives with a window length of 25ms and stride of 10ms. The whole network was trained on a laptop with a nVidia 970M graphics card for 5 days.