Understanding visually grounded spoken language via multi-tasking

An alternative approach for intelligent systems to understand human speech

Understanding spoken language is an important capability of intelligent systems which interact with people. Example applications which use a speech understanding component include personal assistants, search engines and others. The common way of enabling an application to understand and react to spoken language is to first transcribe speech into text using a speech recognition module, and then to process the text with a separate text understanding module.

We propose an alternative approach inspired by how humans understand speech. Speech will be processed directly by an end-to-end neural network model without first being transcribed into text, avoiding the need for large amounts of transcribed speech needed to train a traditional speech recognition system. The system will instead learn simultaneously from more easily obtained types of data: for example, it will learn to match images to their spoken descriptions, answer questions about images, or match utterances spoken in different languages.

Our proposal promises to be less reliant on strong supervision and expensive resources and thus applicable in a wider range of circumstances than traditional systems, especially when large amounts of transcribed speech are not available, for example when dealing with low-resource languages or specialized domains.

Participating organisations

Netherlands eScience Center
Tilburg University
Social Sciences & Humanities
Social Sciences & Humanities

Impact

Output

Team

Christiaan Meijer
eScience Research Engineer
Netherlands eScience Center
GC
Grzegorz Chrupala
Principal investigator
Tilburg University
Jisk Attema
Senior eScience Research Engineer
Netherlands eScience Center
Patrick Bos
eScience Research Engineer
Netherlands eScience Center

Related projects

DIANNA - Deep Insight and Neural Networks Analysis

Explainable AI tool for scientists

Updated 2 months ago
In progress

ePODIUM

Early prediction of dyslexia in infants using machine learning

Updated 17 months ago
Finished

NEWSGAC

Advancing media history by transparent automatic genre classification

Updated 18 months ago
Finished

TICCLAT

Text-induced corpus correction and lexical assessment tool

Updated 18 months ago
Finished

Emotion Recognition in Dementia

Advancing technology for multimodal analysis of emotion expression in everyday life

Updated 21 months ago
Finished

Case Law Analytics

Discovering new patterns in Dutch court decisions

Updated 21 months ago
Finished

What Works When for Whom?

Advancing therapy change process research

Updated 21 months ago
Finished

Dr. Watson

Medical experts helping machines diagnose

Updated 18 months ago
Finished

Related software

DIANNA

DI

Deep Insight And Neural Network Analysis, DIANNA is the only Explainable AI, XAI library for scientists supporting Open Neural Network Exchange, ONNX - the de facto standard models format.

Updated 2 months ago
20 12

Platalea

PL

Platalea contains deep neural network architectures for modeling spoken language from multiple sources at once: audio, and simultaneously images, text or video. This can be used to learn about what different context sources add to the language acquisition process.

Updated 26 months ago
3