skip to main content
10.5555/3523760.3523884acmconferencesArticle/Chapter ViewAbstractPublication PageshriConference Proceedingsconference-collections
research-article

"Cool glasses, where did you get them?": Generating Visually Grounded Conversation Starters for Human-Robot Dialogue

Published: 07 March 2022 Publication History

Abstract

Visually situated language interaction is an important challenge in multi-modal Human-Robot Interaction (HRI). In this context we present a data-driven method to generate situated conversation starters based on visual context. We take visual data about the interactants and generate appropriate greetings for conversational agents in the context of HRI. For this, we constructed a novel open-source data set consisting of 4000 HRI-oriented images of people facing the camera, each augmented by three conversation-starting questions. We compared a baseline retrieval-based model and a generative model. Human evaluation of the models using crowdsourcing shows that the generative model scores best, specifically at correctly referencing visual features. We also investigated how automated metrics can be used as a proxy for human evaluation and found that common automated metrics are a poor substitute for human judgement. Finally, we provide a proof-of-concept demonstrator through an interaction with a Furhat social robot.

Supplemental Material

MP4 File
Supplemental video

References

[1]
E. Klemmer and F. Snyder, “Measurement of time spent communicating,” Journal of Communication, vol. 22, no. 2, pp. 142--158, 1972.
[2]
R. Stalnaker, “Common ground,” Linguistics and philosophy, vol. 25, no. 5/6, pp. 701--721, 2002.
[3]
H. H. Clark and S. E. Brennan, “Grounding in communication.,” 1991.
[4]
M. R. Endsley and D. J. Garland, Situation awareness analysis and measurement. CRC Press, 2000.
[5]
C. Bartneck, T. Belpaeme, F. Eyssel, T. Kanda, M. Keijsers, and S. vS abanović, Human-robot interaction: An introduction. Cambridge University Press, 2020.
[6]
J. Laver, “Linguistic routines and politeness in greeting and parting,” in Conversational routine, pp. 289--304, De Gruyter Mouton, 2011.
[7]
V. uZegarac, “What is “phatic communication”?,” in Current Issues in Relevance Theory (V. Rouchota and A. Jucker, eds.), pp. 327--362, John Benjamins, 1998.
[8]
H. J. Wiener, Conversation pieces: The role of products in facilitating conversation. PhD thesis, Duke University, 2017.
[9]
T. W. Bickmore and R. W. Picard, “Establishing and maintaining long-term human-computer relationships,” ACM Transactions on Computer-Human Interaction (TOCHI), vol. 12, no. 2, pp. 293--327, 2005.
[10]
T. Belpaeme, P. Baxter, R. Read, R. Wood, H. Cuayáhuitl, B. Kiefer, S. Racioppa, I. Kruijff-Korbayová, G. Athanasopoulos, V. Enescu, et al., “Multimodal child-robot interaction: Building social bonds,” Journal of Human-Robot Interaction, vol. 1, no. 2, 2012.
[11]
P. Baxter, E. Ashurst, R. Read, J. Kennedy, and T. Belpaeme, “Robot education peers in a situated primary school study: Personalisation promotes child learning,” PloS one, vol. 12, no. 5, p. e0178126, 2017.
[12]
H. W. Park, I. Grover, S. Spaulding, L. Gomez, and C. Breazeal, “A model-free affective reinforcement learning approach to personalization of an autonomous social robot companion for early literacy education,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 687--694, 2019.
[13]
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64--73, 2016.
[14]
K. Shuster, S. Humeau, H. Hu, A. Bordes, and J. Weston, “Engaging image captioning via personality,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[15]
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32--73, 2017.
[16]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740--755, Springer, 2014.
[17]
L. Yang, K. Tang, J. Yang, and L.-J. Li, “Dense captioning with joint inference and visual context,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
[18]
A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston, “Parlai: A dialog research software platform,” CoRR, vol. abs/1705.06476, 2017.
[19]
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” CoRR, vol. abs/1910.13461, 2019.
[20]
J. J. Randolph, “Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss' fixed-marginal multirater kappa.,” in Joensuu Learning and Instruction Symposium, (Joensuu, Finland), ERIC, 2005.
[21]
P. Jonell, T. Kucherenko, I. Torre, and J. Beskow, “Can we trust online crowdworkers?,” Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, Oct 2020.
[22]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311--318, 2002.
[23]
C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, pp. 74--81, 2004.
[24]
S. Al Moubayed, J. Beskow, G. Skantze, and B. Granström, “Furhat: a back-projected human-like robot head for multiparty human-machine interaction,” in Cognitive behavioural systems, pp. 114--130, Springer, 2012.
[25]
E. Ntoutsi, P. Fafalios, U. Gadiraju, V. Iosifidis, W. Nejdl, M.-E. Vidal, S. Ruggieri, F. Turini, S. Papadopoulos, E. Krasanakis, et al., “Bias in data-driven artificial intelligence systems-an introductory survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 10, no. 3, p. e1356, 2020.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HRI '22: Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction
March 2022
1353 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 07 March 2022

Check for updates

Author Tags

  1. conversational agent
  2. grounding
  3. human-robot interaction
  4. multi-modal dialogue
  5. natural language generation
  6. natural language processing
  7. situatedness

Qualifiers

  • Research-article

Data Availability

Funding Sources

  • Flemish Research Foundation
  • Flemish Government

Conference

HRI '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 268 of 1,124 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media

  翻译: