Ctrl K

Source code and data underlying the publication "Context-Informed Machine Translation of Manga using Multimodal Large Language Models"

Source code and data underlying the publication "Context-Informed Machine Translation of Manga using Multimodal Large Language Models"

5
contributors

Description

The dataset is created for experimental research in multimodal machine translation, specifically to study context-informed translation of manga using large language models. It consists of professionally produced Japanese-to-Polish translations of the slice-of-life manga Love Hina, with volumes 1 and 14 annotated to closely follow existing Japanese text annotations. The data were collected by taking original manga page images and having translators produce aligned Polish translations for all textual elements (speech bubbles, sound effects, etc.), then encoding both original and translated text along with their bounding-box coordinates on each page. In total, the dataset contains 400 page images and 3,705 individual text lines split across the two volumes, provided as a paired image–text corpus with structured metadata suitable for evaluating long-context, multimodal translation methods.

Logo of Source code and data underlying the publication "Context-Informed Machine Translation of Manga using Multimodal Large Language Models"
Keywords
Programming languages
  • Jupyter Notebook 45%
  • JSON 27%
  • XML 23%
  • Python 4%
  • Other 1%
License
  • MIT
</>Source code
Packages
data.4tu.nl

Contributors

JT
Joshua Tanner
KS
Konrad Skublicki
SI
Shonosuke Ishiwatari

Member of community

4TU