Sign in

Understanding visually grounded spoken language via multi-tasking

An alternative approach for intelligent systems to understand human speech

Understanding spoken language is an important capability of intelligent systems which interact with people. Example applications which use a speech understanding component include personal assistants, search engines and others. The common way of enabling an application to understand and react to spoken language is to first transcribe speech into text using a speech recognition module, and then to process the text with a separate text understanding module.

We propose an alternative approach inspired by how humans understand speech. Speech will be processed directly by an end-to-end neural network model without first being transcribed into text, avoiding the need for large amounts of transcribed speech needed to train a traditional speech recognition system. The system will instead learn simultaneously from more easily obtained types of data: for example, it will learn to match images to their spoken descriptions, answer questions about images, or match utterances spoken in different languages.

Our proposal promises to be less reliant on strong supervision and expensive resources and thus applicable in a wider range of circumstances than traditional systems, especially when large amounts of transcribed speech are not available, for example when dealing with low-resource languages or specialized domains.

Participating organisations

Netherlands eScience Center
Tilburg University

Team

Christiaan Meijer
Christiaan Meijer
eScience Research Engineer
Netherlands eScience Center
GC
Grzegorz Chrupala
Principal investigator
Tilburg University
Jisk Attema
Jisk Attema
Senior eScience Research Engineer
Netherlands eScience Center
Patrick Bos
Patrick Bos
eScience Research Engineer
Netherlands eScience Center

Related projects

NEWSGAC

Advancing media history by transparent automatic genre classification

Updated 6 days ago
Finished

TICCLAT

Text-induced corpus correction and lexical assessment tool

Updated 6 days ago
Finished

Emotion Recognition in Dementia

Advancing technology for multimodal analysis of emotion expression in everyday life

Updated 6 days ago
Finished

What Works When for Whom?

Advancing therapy change process research

Updated 6 days ago
Finished

Related tools

DIANNA

DI

Deep Insight And Neural Network Analysis, DIANNA is the only Explainable AI, XAI library for scientists supporting Open Neural Network Exchange, ONNX - the de facto standard models format.

Updated 3 weeks ago
5 8

Platalea

PL

Platalea contains deep neural network architectures for modeling spoken language from multiple sources at once: audio, and simultaneously images, text or video. This can be used to learn about what different context sources add to the language acquisition process.

Updated 5 months ago
3