345
Views
0
CrossRef citations to date
0
Altmetric
Full Papers

Multi-robot cooperative behavior for reducing unnaturalness of starting a conversation

ORCID Icon, &
Pages 465-481 | Received 11 Aug 2023, Accepted 15 Nov 2023, Published online: 26 Dec 2023

Abstract

In a human–robot conversation, it is difficult for the robot to start the conversation just when the addressee is ready to listen to the robot, due to recognition technology issues. This paper proposes and evaluates a method to reduce the sense of that the timing of starting the conversation is bad. In this method, two robots perform a cooperative behavior during a waiting time from the call for attracting addressee's attention to the main utterance the robot want to deliver. To evaluate the effectiveness of this method, we conducted an experiment that compared three conversation initiation approaches: early timing and late timing by one robot, and the proposed approach involving two robots. The results revealed that the proposed method mitigates the bad timing of main utterances as perceived by the participants.

GRAPHICAL ABSTRACT

1. Introduction

Social robots have been used to provide information or for advertisements at museums [Citation1, Citation2], expositions [Citation3], receptions [Citation4], shops [Citation5], stations [Citation6], and shopping malls [Citation7]. To provide positive impressions about the information, starting a conversation naturally is important as well as the contents. Unnatural conversation start causes strange impressions to addressees. There are studies of a start of a conversation by a robot to improve a quality of the conversation.

However, starting a conversation naturally is not easy for robots. Let's consider a case of human–human interaction. Goodwin [Citation8] has shown systematic procedures in which speakers obtain the attention of an addressee. In a simple case that a speaker attempts to start a conversation with an addressee, the speaker's behaviors tend to proceed as follows.

  1. The speaker offers a short greeting term like ‘Hey’ to attract the addressee's attention.

  2. The speaker waits for the addressee being ready to listen.

  3. The speaker then offers a main topic for discussion.

To achieve these behaviors, robots need to recognize whether the addressee is ready to listen as in the second step above. This recognition will be composed of a variety of sensing mechanisms such as face [Citation9] and gaze direction [Citation10–13], a distance and a positional relation with an addressee [Citation8, Citation14, Citation15]. However, the recognition sometimes fails in real environment.

Recognition failure of addressee's acknowledgement prevents robots from starting a conversation naturally. Figure  shows two examples of unsuccessful interaction. In the first scene of (a), the robot says ‘Hey’ to attract the addressee's attention. After that, the robot waits for the addressee and is ready to listen. Here, the robot recognized the addressee being ready to listen but in fact he was not. This is a false positive recognition event. In this case, as shown in the second scene, the robot starts talking on a main topic, even though the addressee is not ready to listen. Such behaviors would not only annoy the addressee but also result in a failure of communicating the main topic, as shown in the third scene.

Figure 1. Two examples of miscommunication of starting a conversation. (a) false positive recognition causes too-early main talks. (b) false negative recognition causes too-late main talks.

Figure 1. Two examples of miscommunication of starting a conversation. (a) false positive recognition causes too-early main talks. (b) false negative recognition causes too-late main talks.

To avoid such an unpleasant situation for the addressee, we can tighten the threshold for the recognition to reduce the likelihood of a false positive. However, this change increases the risk of a false negative recognition. As shown in Figure (b), even though the addressee was ready, robots do not speak about the main topic for an extended time.

In this study, we propose and evaluate a method to reduce the sense of that the timing of starting the conversation is bad. The method is inspired by the effects of coordination of the behavior of multiple robots, which has been the focus of much attention recently. This paper is organized as follows. The next section introduces previous works related to the method, and the third section described a design of a behavior pattern of multiple robots. The fourth and fifth section explain two experiments and results to verify the effectiveness of the method. We discuss the effectiveness of the proposed method in the sixth section. Finally, the seventh section concludes our study.

2. Related work

2.1. Starting a conversation by a robot

There are studies of a start of a conversation by a robot to improve a quality of the conversation. Many studies have revealed the importance of eye contact in starting a conversation [Citation9–12]. For example, Kuzuoka et al. [Citation14] examined interactions between visitors and staffs within elderly day care in detail, using ethnomethodology methods. They reported that a gaze direction of a caregiver let an elderly know whether they could start a conversation or not. Kompatsiari et al. [Citation13] investigated how eye contact between a human and a robot affects human engagement with a robot. They found that humans are more likely to see the robot's face when the robot is making eye contact (implicit engagement). This result suggests that measuring human eye gaze is important in determining whether humans are ready to listen to the robot. Nakano et al. [Citation10] analyzed an addressee's gaze behavior and found that patterns of gaze transition correlated with human participantive or observational judgement of an addressee's engagement in a conversation. They proposed an engagement estimation method by judging a gaze direction and showed that the method improves the impression of the addressee in human–agent interaction. In addition, a distance and a positional relation between a person and a robot is also useful for judging [Citation8, Citation14, Citation15]. Shi et al showed that controlling the distance and the positional relation improves a quality of a conversation [Citation15].

Those studies have tried to model a human behavior in starting a conversation and apply the model to a robot. If the robot could recognize a gaze direction, a distance and a positional relation with an addressee, the methods proposed in the studies would work well. However, if a recognition failure happened, the methods would not perform well. It is unknown how the robot copes with that case because the previous studies did not mention it.

In this paper, we propose a behavior design for starting a conversation using multi-robots as a precaution for a case that the robot could not recognize cues of starting a conversation.

2.2. Effectiveness of using multi-robots

Studies of human–robot interaction have reported that the use of multiple robots changes the impressions and behaviors of the interacting partner [Citation16–19]. The following studies have suggested potential merits of a social context generated by multi-robots in conversation.

Sakamoto et al. [Citation16] developed robots as a passive-social medium, in which multiple robots converse with each other. They conducted a field experiment at a station to investigate the effects of the robots. They found that people were more likely to stop to listen to a conversation between two robots than to listen to a single robot. The results suggested that the social context generated by two robots attracted user's attention more than a speech by one robots.

Iio et al. [Citation18] developed a turn-taking pattern in which two robots behave according to a pre-scheduled scenario to avoid the robot's verbal responses sometimes sounding incoherent in the context of the conversations. They conducted experiments and proved that participants who talked to two robots using the turn-taking pattern felt the robot's responses to be more coherent than those who talked to a single robot not using it. It suggested that the social context generated by a bystander robot had a positive effect on the user's impression of a conversation.

Those findings suggested that the social context generated by two robots has a potential to improve a quality of conversation. We consider a tendency shown in Iio et al. [Citation18] that a cooperative behavior by a bystander robot reduces a perception of incoherence that an addressee felt will be useful for a design of a start of a conversation.

3. Behavior pattern for multi-robot to start a conversation

In the case of a single robot, if the robot could not speak about a main topic with near perfect timing, people would feel the behavior to be unnatural. Our idea to avoid this problem is simple. It is that two robots keep an addressee waiting for a little longer time but alleviate the sense of the bad timing of the main topic. The idea is based on the following hypothesis : If a robot speaks on a main topic before an addressee is ready (i.e. Figure ), it inhibits the communication because this not only annoys them, but also results in a failure to communicate on the main topic. We consider that it is better to keep the addressee waiting a bit longer, than to force a miscommunication. This idea is stemmed from the basic principles of conversation proposed by Sucks et al. [Citation20]; that is, the principle that conversation is a coordinated activity, with participants constantly navigating and negotiating their turns in such a way as to maintain the flow and coherence of the interaction. This holds true even in the dynamics of conversations of three or more people. Based on this idea, we designed an interaction between two robots as shown in Figure .

Figure 2. The proposed behavior pattern between two robots before starting a conversation.

Figure 2. The proposed behavior pattern between two robots before starting a conversation.

  1. One of robots, called a speaker robot, faces an addressee and speaks a short greeting term like ‘Hey’ to attract their attention. At that time, another robot, called a bystander robot, turns to them in several hundreds of milliseconds (e.g. 0.5 s).

  2. After a few seconds (e.g. 2.0 s), the bystander robot looks back at the speaker robot and says something to fill the empty time from the short greeting term to the main topic. In Figure , the bystander robot says, ‘What's going on?’ The speaker also turns back to the bystander robot when the bystander robot starts to speak.

  3. The speaker robot responds to the utterance of the bystander robot.

  4. Finally, after the response, the speaker robot turns to the addressee again and speaks on the main topic. The bystander robot also looks at them during this statement.

The point of the behavior pattern is that two robots try to reduce an addressee's aggravation by demonstrating their short conversation. The short conversation creates enough time for the addressee to be ready to listen. If the addressee gives their attention early, they observe the conversation. We suppose this situation will be more tolerable for them than a situation where the robot does not say anything for a few seconds. Of course, if the addressees attention occurs later, they would listen to the main participant at an acceptable time.

Note that it may be possible to fill the interval between the call and the main utterance with a single robot monologue. For example, after the short greeting term, a robot can talk to itself such as ‘What was I trying to say? I forgot. Uh, Oh! I remembered!’ However, we have considered that a two-robot dialog is more likely to provide room for interpretation regarding the unnatural timing of the main utterance than a single robot monologue.

The monologue ‘I forgot what I was supposed to say’ shown above provides room for interpretation regarding the lateness of the robot's main utterance, but it would be difficult to provide similar room for interpretation with other monologues. For example, monologues such as ‘What should I eat for lunch?’ or ‘Later, I have to do that job’ make the user wonder ‘Why is this robot talking to itself when I am already ready to listen?’ In other words, these types of monologues would not provide room for interpretation regarding the lateness of the main utterance and result in causing the user irritated. On the other hand, a dialogue between two robots can provide room for interpretation regarding the lateness of the robot's main utterance, independent of its content. This is because it is obvious that the robot calling the user cannot make its main utterance until the dialog started by the interruption of another robot is completed. For example, even if another robot interrupts with ‘What should we eat for lunch?’, the user can interpret it as ‘This robot is in a situation where it cannot talk to me until it finishes its dialogue with the other robot.’ In other words, the dialogue between the two robots can provide a certain room for interpretation, regardless of its content.

Thus, while monologues are difficult to create variations on, dialogues are relatively easy to create a variety of content. For these reasons, it is reasonable to begin our research on the use of dialogue by two robots.

4. Experiment 1

Experiment 1 aims to evaluate the effectiveness of the proposed method in a situation where a robot starts talking when the addressee is concentrating on something. The details of the experiment are described below. This experiment was approved by the ethics committee of Osaka University.

4.1. Task

In order to manipulate participant's focus, a task was administered, wherein the participant was instructed to respond in accordance with the visual cues displayed on the monitor (see Figure  Stimulus). The participants' actions proceeded as follows: Initially, the participants fixated their gaze on the cross symbol presented on the monitor. Subsequently, upon the appearance of a left arrow (the stimulus), the participants directed their attention towards the robot positioned on the left side. During this phase, the robot vocalized the prescribed content and delivered its main utterance. Following the robot's main utterance, participants assessed the goodness of the timing of the utterance through the graphical user interface (GUI) displayed on the monitor. The temporal gap between the call and the main utterance and the timing of presenting stimuli were manipulated according to the design of the experiment.

Figure 3. Experiment 1 environment.

Figure 3. Experiment 1 environment.

The call was ‘Hey’ each time, and for the main utterance, several utterances to cheer up the participants (such as ‘good luck’) were prepared and uttered at random.

4.2. Conditions

Our study employed a two-factor within-participant experimental design to assess the efficacy of the proposed method, considering both the method itself and the timing of stimulus presentation as factors. The respective levels of each factor were as follows.

4.2.1. Method factors

The method factor encompassed three distinct levels: Two-7.5, One-7.5, and One-3.0. The Two-7.5 level represents the application of the proposed method, wherein a robot initiates a call and engages in a brief conversation with another robot before proceeding to deliver the main utterance. After the robot's call, another robot randomly uttered either the sentence ‘Let me say it’ or ‘You can say it’. When ‘Let me say it’ was selected, the robot that interrupted spoke the main utterance, and when ‘You can say it,’ was selected, the robot that made the call spoke the main utterance. The time interval between the call and the main utterance in this level was set at 7.5 s. In the One-7.5 level, a control condition was established where a single robot initiated the call, waited for 7.5 s, and then delivered the main utterance. As for the One-3.0 level, another control condition was implemented, featuring a waiting time of 3.0 s.

The inclusion of the One-3.0 level stemmed from observations made during the preliminary experiment. The preliminary experiment was conducted to identify the appropriate time between the call and the main utterance when people could respond to the robot's call at any time. The testing procedure was as follows: the robot called out to the participant, and then after a random time ranging from 1 to 7 s from the start of the call, the main utterance was made. Participants rated whether the timing of its main utterance was good. Participants were members of the laboratory. The results of the preliminary experiment indicated that participants' response time to the robot's call predominantly fell within the range of 1.5 to 2.0 s. Additionally, based on user preferences, a time gap of approximately 1.0 second between the participant's visual perception of the robot and the initiation of the robot's speech was considered desirable. Therefore, to cater to individuals capable of responding relatively swiftly, a waiting time of around 3.0 s between the call and the main utterance was deemed suitable and included as part of the experimental design.

4.2.2. Stimulus timing factors

The stimulus timing factor comprised four levels, each differing by 1.5-second intervals, ranging from 1.5 s to 6.0 s. These varied levels were implemented to investigate the efficacy of the proposed method under different durations while participants directed their attention towards the robot. The timeline depicting the stimulus timing, from the initial call to the main utterance, is illustrated in Figure .

Figure 4. Experiment 1 design.

Figure 4. Experiment 1 design.

In the One-3.0 level, it is predicted that a stimulus timing of 1.5 s will be rated positively, while other timings will receive lower ratings. Similarly, for the One-7.5 level, a stimulus timing of 6.0 s is expected to be perceived favorably, while other timings are anticipated to be rated lower. If the proposed method indeed prove effective, it is expected that all stimulus timings will receive moderately good evaluations across all stimulus timing levels.

4.3. Participants

Participants for this study were enlisted by advertising part-time experimental positions at Osaka University through a bulleti n board. To achieve an equitable gender distribution, individuals were selected from the pool of applicants, leading to a total of 12 participants (comprising 6 females and 6 males). All participants were enrolled as university students at Osaka University.

4.4. Environment and apparatus

A table-top humanoid robot, CommU, manufactured by Vstone Co.,Ltd.Footnote1, was used in this study; CommU is approximately 30 cm tall with a child-like appearance. It has a total of 14 of freedom to express basic body movements in dialogues, such as eye gaze, beat gestures, etc. In particular, the eye gaze enables eye-contact with a dialogue partner. The robot can also open and close its mouth while playing back a voice file generated by the speech synthesis software. For the speech synthesis software, a male child's voice was employed. In this experiment, robots must have presented to the user clearly to whom each robot was talking and to whom it was directing its attention. From this perspective, it was appropriate to use CommU, which can represent the targets of speech and attention nonverbally by controlling the eyes and mouth.

4.5. Procedure

Upon arrival at the laboratory, participants received an initial briefing from the experimenter regarding the experiment's purpose and procedures. Subsequently, those who consented to participate signed the required consent form. For this study, all 12 participants willingly agreed to take part in the experiment. Following the completion of the consent process, participants assumed a seated position in the chair situated at the center and awaited the commencement of the task.

Given that the experiment adopted a within-participant design, participants underwent the task at all levels of each factor. Specifically, there are two factors in this experiment: a method factor (One-3.0 level, One-7.5 level, and Two-7.5 level) and a stimulus timing factor (1.5 level, 3.0 level, 4.5 level, and 6.0 level). After randomly selecting one level from the method factor, each level of the stimulus timing factor was presented randomly four times, with that level of the method factor fixed. The rationale for conducting four repetitions for each level was twofold: firstly, to obtain the average value as the measurement, and secondly, to mitigate potential errors. Consequently, participants completed the task a total of 48 times (3 levels of method factor × 4 levels of stimulus timing factor × 4 repetitions). To ensure balanced treatment, the implementation order of each level within the method factor was counterbalanced. The inter-task interval, i.e. the duration between the completion of the main utterance evaluation and the robot's subsequent call, was set at 30 s.

Regarding the presence of the robot, for the One-3.0 and One-7.5 levels, only one robot was placed on the desk and the other robot had been hidden under the desk. In the Two-7.5 level after the One-3.0 level (or the One-7.5 level), the experimenter first took participants away from the experimental site, took the other robot from under the desk and put on the desk, and then return the participants to the experimental site. Similarly, in the One-3.0 level (or the One-7.5 level) after the Two-7.5 level, the experimenter took participants away, the other robot was removed to under the desk, and then returned the participants. In other words, participants did not see two robots at the One-3.0 level (or the One-7.5 level). Therefore, there was no effect of the presence of the robots on the participants' perceptions.

4.6. Measurements and analysis

Participants used a 5-point scale to rate the goodness of the timing of the main utterance. The scale ranged from 1 (most negative rating) to 3 (neutral rating) and up to 5 (most positive rating).

The purpose of Experiment 1 was to verify whether the proposed method would rate the timing of the robot's main utterance as good, even if the timing of participants' attention to the robot varied. Therefore, we gave participants the following explanation before the experiment: the robot utters a call like ‘Hey’ followed by a cheering message like ‘Good luck.’ You evaluate the timing of the utterance of that cheering message. In the evaluation, five options of ‘bad,’ ‘rather bad,’ ‘undecided,’ ‘rather good,’ and ‘good’ will be presented. Please choose the option that is closest to your impression. In the analysis, ‘bad’ was assigned to 1, ‘rather bad’ to 2, ‘undecided’ to 3, ‘rather good’ to 4, and ‘good’ to 5, respectively.

For each combination of the method factor level and the stimulus timing factor level, participants provided four rating scores, from which an average rating was calculated for each combination.

To assess the effectiveness of the proposed method, a within-participant two-factor analysis of variance (ANOVA) was performed on the method factor and stimulus timing factor. The chosen significance level for statistical significance was set at 0.05. In the post-hoc test, the Bonferroni method was applied to correct p-values for multiple comparisons.

4.7. Results

The means of the sense of the timing of main utterances is good at the One-3.0 level were 4.35 (SD=0.559) at the 1.5 second level, 1.96 (SD=0.582) at the 3.0 second level, 1.56 (SD=0.675) at the 4.5 second level, and 1.60 (SD=0.822) at the 6.0 second level. The means of the sense at the One-7.5 level were 1.85 (SD=0.670) at the 1.5 second level, 2.44 (SD=0.770) at the 3.0 second level, 3.08 (SD=0.973) at the 4.5 second level, and 4.46 (SD=0.498) at the 6.0 second level. The means of the sense at the Two-7.5 level were 3.29 (SD=0.976) at the 1.5 second level, 3.50 (SD=0.859) at the 3.0 second level, 3.58 (SD=0.606) at the 4.5 second level, and 3.81 (SD=0.880) at the 6.0 second level.

Figure  shows the average scores in each level.

Figure 5. Result of the goodness of the timing of main utterance. Error bars mean 95% CI.

Figure 5. Result of the goodness of the timing of main utterance. Error bars mean 95% CI.

The two-factor ANOVA showed a main effect for the method factor (F(2,22)=26.13,p<.001,ηp2=0.704), a main effect for the stimulus timing factor (F(3,33)=8.92,p<.001,ηp2=0.448), and an interaction between the two factors (F(6,66)=31.89,p<.001,ηp2=0.774). We show only the post-hoc test results relevant to the evaluation of the effectiveness of the proposed method are presented below.

The results of the post-hoc test for the method factor showed significant differences between the One-3.0 and One-7.5 levels (t(11)=3.93, p = .007, Hedge's g = −1.314), between the One-3.0 and Two-7.5 levels (t(11)=7.02, p<.001, Hedge's g = −2.993) and between the One-7.5 and Two-7.5 levels (t(11)=3.46, p = .016, Hedge's g = −1.144), respectively. In other words, the ratings of the timing of main utterances at the Two-7.5 level were significantly higher than at the other levels.

The results of the post-hoc test for the interaction showed in Table . We summarized simple main effects below:

  • 1.5 level: The rating at the One-3.0 level were significantly higher than at the One-7.5 level.

  • 3.0 level: The rating at the Two-7.5 level were significantly higher than at the One-3.0 level.

  • 4.5 level: The rating at the One-7.5 and Two-7.5 level were significantly higher than at the one-3.0 level.

  • 6.0 level: The rating at the One-7.5 and Two-7.5 level were significantly higher than at the one-3.0 level.

Table 1. Results of the post-hoc tests of the interaction between the method factor and the stimulus timing factor.

4.8. Discussion

It was observed that the evaluation of the main utterance timing at the Two-7.5 level significantly outperformed the other levels concerning the overall timing of stimulus presentation. This indicates that the proposed method achieved superior timing of the main utterance compared to the method employing a single unit of the main utterance. The rationale behind this outcome is explored by examining the rating trends at the One-3.0 and One-7.5 levels.

At both the One-3.0 and One-7.5 levels, the main utterance received high ratings when delivered immediately after the participant turned to face the robot. However, for other timing instances, the main utterance tended to receive considerably lower ratings. Specifically, at the One-3.0 level, when the robot delivered the main utterance promptly after the participant's turn to the robot at the 1.5 level, the quality of the main utterance received a notably high rating. However, it was significantly low rating at the 3.0, 4.5, and 6.0 levels because the main utterance was delivered simultaneously or before when participants direct toward the robot.

Similarly, at the One-7.5 level, when the robot initiated the main utterance immediately after the participant turned to the robot at the 6.0 level, the main utterance quality was rated quite high. Conversely, at the 1.5 and 3.0 levels, the robot waited for 6.0 and 4.5 s, respectively, after the participant's turn before delivering the main utterance. These unnatural long waiting time seemed to cause participants' negative impressions. At the 4.5 level, where the wait time was relatively short (3.0 s), the timing of the main utterance was perceived as neither good nor bad.

In contrast to the One-3.0 and One-7.5 levels, the Two-7.5 level consistently achieved good timing of the main utterance, scoring at least 3 points (unequivocal) across all stimulus presentation timing levels. Although it showed a slight decrease (not statistically significant) when compared to perfectly matched stimulus presentation timings (i.e. One-3.0 with 1.5 levels, One-7.5 with 6.0 levels), it maintained a rating of ‘reasonably good’ for all other stimulus presentation timing levels. Consequently, overall, the Two-7.5 level demonstrated superior timing of main utterances compared to the other levels.

The above findings underscore that the proposed method impressively make the timing of main utterances appear better by bridging pauses with robot-to-robot conversations, even when individuals are prepared to listen to the robot sooner than anticipated.

5. Experiment 2

In Experiment 1, the situation differed significantly from real-world scenarios as the timing of participants' attention to the robot was regulated by the experimenter. In practical settings, individuals have the autonomy to direct their attention based on their current context. Hence, the efficacy of the proposed method in situations where humans can control the timing of their attention to the robot remains uncertain. To address this issue, we conducted Experiment 2, employing a task that allows for controlled timing of attention, with the aim of demonstrating the effectiveness of the proposed method in such contexts. This experiment was approved by the ethics committee of Osaka University.

5.1. Hypothesis

The timing of the addressee's attention towards the robot, with whom the robot engages in conversation, is contingent upon the addressee's cognitive load at that moment. A low cognitive load implies that the addressee is not preoccupied and is likely to respond promptly to the robot. Conversely, a high cognitive load suggests the addressee is engrossed in a task, resulting in a delayed response to the robot's interaction. Therefore, the following hypothesis was formulated:

  • Under conditions of low cognitive load, individuals prefer that robots initiate the main utterance promptly, without any delay.

  • Under conditions of high cognitive load, individuals prefer to start the main utterance after an appropriate delay.

Moreover, the proposed behavior pattern is expected to be favored by individuals experiencing both low and high cognitive loads. Specifically:

  • Individuals with a low cognitive load exhibit a preference for the proposed behavior pattern over one with a delay and express an equal preference between the proposed behavior and one without any delay.

  • Individuals with a high cognitive load favor the proposed behavior pattern over one without any delay but show an equal preference for the proposed behavior pattern and one with a delay.

We conducted an experiment to verify these hypotheses. We developed a game called ‘Detective game online’ for the experiment. While the participant is playing the game, a robot tries to start a conversation. The details of the game are explained below.

5.2. Detective game online

To verify our hypothesis, we would like to control the cognitive load of participants at the time a robot attempts to start the main utterance. For this purpose, we developed a game called ‘Detective game online’.

5.2.1. How to play the detective game

The game is conducted using Skype with audio-only communication; no video images are involved. The game comprises one participant designated as the question master and two confederates assigned as detectives. The detectives are given a paper containing a riddle, while the question master receives a paper containing both the riddle and its answer. The detective's objective is to deduce the answer by posing a series of questions to the question master. The question master's role is to guide the detective towards the correct answer by responding with only ‘Yes, that's right’ or ‘No, that's wrong.’ Due to this constraint, the detective must craft thoughtful questions to swiftly arrive at the solution. An example of a riddle and its answer is provided below:

Riddle: Two men who are close friends met for the first time in ten years, yet they refrained from exchanging any words. They had no visual or hearing impairments, and the location where they met did not prohibit conversation. Why did they remain silent towards each other? Answer: They were divers and encountered each other underwater.

The subsequent example showcases a question posed by a detective and the corresponding response from the master:

Detective: Did they have an argument at that time? Master: No, that's wrong.

As shown in the example, participants had to listen to a question from the detective actor, understand it, and respond to lead the detective actor to the answer. This was a much more complex task than in Experiment 1 and required participants to concentrate on the game.

5.2.2. Controlling confederate behaviors

To ensure that the behaviors of the confederates assuming the detective role do not influence the participant's perceptions, we devised a script for them. Additionally, we recorded the confederates' voices as they spoke the script. During the experiment, an experimenter, positioned out of the participant's view, played the voice recordings. Consequently, the confederates did not engage in direct conversations with the participant. This approach afforded us precise control over the confederates' behaviors, encompassing content, speed, volume, and speech patterns. To prevent participants from recognizing the use of recorded voices, we introduced background noise to the voice recordings. Furthermore, to account for instances where a participant might request the addressee to repeat a question (e.g. ‘I didn't catch you. Could you repeat it again?’), we recorded two versions of each question.

Moreover, the decision to employ two confederates as detectives, rather than just one, was deliberate and based on specific reasons. When participants undertake the detective role without a script, they may encounter challenges in formulating a coherent series of questions within a limited timeframe. Furthermore, participants might discern that a detective confederate is merely reciting from a script. To circumvent this potential issue, we enlisted two confederates as detectives, with one confederate visibly pondering a question while the other poses the question to the participant. This approach not only enhances the authenticity of the detective role but also reduces the likelihood of participants suspecting the involvement of confederates in the game.

5.3. The appropriateness of adopting a detective game

There are some advantages in using the detective game for our experiment:

  • The detective game reduces side effects caused by the confederates. Their nonverbal behavior such as gaze, facial expressions, body posture, and movement would affect the participant's thought process. To investigate the effects of robot's behaviors, we should control these factors but this is difficult for face-to-face communication. By using Skype without video images, we can reduce unexpected side effects. It allows us to focus on the robot behavior.

  • In addition to the first point, we can control the confederate's behavior that might otherwise be difficult to control such as volume, speed, and speech patterns with an operator using pre-recorded voices. We expect the participant not to notice that the detective does not exist because this game has a clearly determined protocol for communication.

  • Finally, we can control the participant's cognitive load by manipulating the timing of robot and detective behavior. If a robot speaks on a main topic while a detective is asking, the participant's cognitive load will be quite high. This is because the participant is hearing both voices at the same time. In contrast to this, if a robot speaks on the main topic before a detective is asking, the participant's cognitive load will low. We describe the details of manipulating the timing in the Condition section.

5.4. Conditions

To test our hypothesis, we established two factors. The experiment was designed as a within-participant study, ensuring that participants experienced all conditions.

5.4.1. Method factor

Similar to Experiment 1, we incorporated three levels for the method factor in this study: One-3.0, One-7.5, and Two-7.5. In the One-3.0 and One-7.5 levels, a single robot initiated a call, waited for 3.0 or 7.5 s, respectively, before delivering the main utterance. In contrast, the Two-7.5 level involved two robots engaging in a brief conversation during the waiting time of 7.5 s. In the Two-7.5 level, after the robot's call, another robot said, ‘Hmm? What's up?’. The robot that made the call and the main utterance and the robot that interrupted were the same each time.

5.4.2. Detective behavior factor

This factor pertains to the behavior exhibited by the detectives during the interaction. When a robot communicates with a participant engaged in the detective game on Skype, the participant's status is classified into four categories of detective behavior: S1, S2, S3, and S4. Figure  illustrates the temporal overlap between the detective's utterance and the robot's utterance for each level. In Figure , the detective's utterance is represented by the dark gray rectangle. The meaning of each level is elucidated below:

  • S1: The detective remains silent while the robot speaks. This scenario facilitates easy listening to the robot, indicating that the participant experiences minimal cognitive load.

  • S2: The detective starts speaking after the robot delivers a brief greeting. In the One-3.0 condition, the detective's utterance overlaps with the robot's main topic. However, in the One-7.5 condition, the overlap is absent, and in the Two-7.5 condition, the detective's utterance overlaps with a conversation between the robots but not with the main topic.

  • S3: The detective commences speaking just before the robot delivers a short greeting. In the One-3.0 condition, the detective's utterance overlaps both the robot's short greeting and the main topic. However, in the One-7.5 condition, the overlap is restricted to the robot's short greeting but not the main topic. In the Two-7.5 condition, the overlap encompasses the robot's short greeting and the conversation between the robots but not the main topic.

  • S4: The detective initiates speaking 3.5 s before the robot's short greeting. In the One-3.0 and One-7.5 conditions, the detective's utterance overlaps with the robot's short greeting but not the main topic. In the Two-7.5 condition, the overlap involves the robot's short greeting and slightly extends into the conversation between the robots.

Figure 6. Experiment 2 design.

Figure 6. Experiment 2 design.

5.4.3. Predictions

We anticipated specific effects on the participant's cognitive load resulting from the combination of factors. In scenarios S2 and S3 during the One-3.0 condition, the detective's utterance overlaps with the robot's main topic. As a consequence, participants are likely to experience a heightened cognitive load compared to other conditions. Conversely, in situations S1 and S4 during the One-7.5 condition, prolonged intervals occur when no one speaks. These conditions impose minimal cognitive load on the participant.

More specifically, We predicted that, in S2 and S3 on One-3.0, the participant would feel the timing of the robot's main topic to be too early. In S1 and S4 on One-7.5, the participant feels the timing too late because they have the waiting period. In addition, for these conditions, participants will not feel the timing is natural. In Two-7.5, we believe that a participant would not feel as negative as the prior conditions, in other words, the participant will provide a relatively good evaluation in most cases.

5.5. Participants

The number of participants was eighteen (nine males and nine females). The participants were university students.

5.6. Environment and apparatus

Figure  depicts the experimental setup and equipment utilized in the study. The robots were positioned on a table adjacent to a wall, with a poster on the wall behind them. Additionally, a laptop computer running Skype was situated on another table approximately 1.5 m away from the robots. The participant occupied a chair placed at a distance of approximately 120 cm from the robot(s) and faced at an angle of 135 to the right of the robots. Consequently, the robots were diagonally positioned, situated backward and to the left of the participant, and were not visible to the participant when they faced forward. To engage in the detective game on Skype, the participant employed the laptop on the table in front of them. During the game, the participant communicated towards the desktop microphone on the table and received inputs from the detectives via a headphone. Similar to Experiment 1, we utilized the CommU robot in this study.

Figure 7. Experiment 2 environment.

Figure 7. Experiment 2 environment.

To ensure consistent interactions, the robots followed scripted dialogues and gestures prepared for three distinct conditions. The robot's behavior and the details of these conditions are elaborated upon in the subsequent section.

5.7. Procedure

The experiment followed much the same procedure as Experiment 1. Participants were first given an explanation of the purpose and procedure of the experiment and then asked to fill out a consent form. 18 participants agreed to participate in the experiment.

The method of presenting the levels of each factor was the same as in Experiment 1. One level (either the One-3.0 level, One-7.5, or Two-7.5 level) was selected from the method factor, and the levels of the stimulus timing factor (the S1, S2, S3, and S4 levels) were randomly presented based on that level of the method factor. Participants rated the main utterance each time a stimulus (i.e. the robot's main utterance) was given. This procedure was repeated for each level of the method factor.

5.8. Measurements and analysis

After each robot's question, participants completed a questionnaire consisting of items generated from the following three aspects.

  • (a) Did you feel the timing of the question wording too early ?

  • (b) Did you feel the timing of the question wording too late?

  • (c) Did you feel the timing of the question wording to be natural?

Participants provided ratings for these items using a 5-point scale, where scores of one, three, and five corresponded to disagreement, neutrality, and agreement, respectively.

To assess the effectiveness of the proposed method, a within-participant two-factor analysis of variance (ANOVA) was performed on the method factor and detective behavior factor. The chosen significance level for statistical significance was set at 0.05. In the post-hoc test, the Bonferroni method was applied to correct p-values for multiple comparisons.

5.9. Results

5.9.1. (a) The sense of that the timing is too early

The results of the evaluation of whether the timing of the robot question was too early are shown in Figure . The means of the sense at the One-3.0 level were 2.61 (SD=1.420) at the S1 level, 4.11 (SD=1.183) at the S2 level, 2.94 (SD=1.626) at the S3 level, and 2.83 (SD=1.543) at the S4 level. The means of the sense at the One-7.5 level were 1.39 (SD=0.850) at the S1 level, 1.39 (SD=0.698) at the S2 level, 1.33 (SD=0.686) at the S3 level, and 1.33 (SD=0.594) at the S4 level. The means of the sense at the Two-7.5 level were 1.78 (SD=1.003) at the S1 level, 1.94 (SD=1.211) at the S2 level, 1.83 (SD=0.985) at the S3 level, and 1.39 (SD=0.698) at the S4 level.

Figure 8. Results of the rapid timing of main utterance. Error bars mean 95% CI.

Figure 8. Results of the rapid timing of main utterance. Error bars mean 95% CI.

A two-factor analysis of variance revealed a main effect for the method factor (F(2,34)=32.69,p<.001,ηp2=0.658), a main effect for the call timing factor (F(3,51)=7.12,p<.001,ηp2=0.295), and an interaction between the method factor and the call timing factor had an interaction effect.

In order to test the effect of the proposed method on the overall timing of the call, we conducted a post-hoc test of the main effect of the method factor. The results showed significant differences between the One-3.0 and One-7.5 levels (t(17)=6.93, p<.001, Hedge's g = 2.228) and between One-3.0 and Two-7.5 (t(17.0)=5.33, p<.001, Hedge's g = 1.578) and between One-7.5 and Two-7.5 (t(17.0)=2.33, p<.097, Hedge's g = −0.593).

The results of the interaction subtests are presented in Table . The pairs of levels for which a simple main effect was found are summarized below:

Table 2. Results of the post-hoc tests of the interaction between the method factor and the detective behavior factor.

5.9.2. (b) The sense of that the timing is too late

The results of the evaluation of whether the timing of the robot question was too late are shown in Figure . The means of the sense at the One-3.0 level were 1.89 (SD=1.079) at the S1 level, 1.50 (SD=0.786) at the S2 level, 1.67 (SD=1.029) at the S3 level, and 1.56 (SD=0.922) at the S4 level. The means of the sense at the One-7.5 level were 3.28 (SD=1.447) at the S1 level, 3.06 (SD=1.349) at the S2 level, 3.00 (SD=1.572) at the S3 level, and 3.44 (SD=1.423) at the S4 level. The means of the sense at the Two-7.5 level were 2.00 (SD=1.138) at the S1 level, 2.00 (SD=1.188) at the S2 level, 1.61 (SD=0.850) at the S3 level, and 1.89 (SD=1.183) at the S4 level.

Figure 9. Result of the late timing of main utterance. Error bars mean 95% CI.

Figure 9. Result of the late timing of main utterance. Error bars mean 95% CI.

We found a main effect on the method factor (F(2,34)=23.40,p<.001,ηp2=0.579). There was no main effect on the detective behavior factor (F(3,54)=0.972,p=.413,ηp2=0.054). and no interaction between those factors (F(6,102)=0.524,p=.0789,ηp2=0.030).

Multiple comparison showed that the score of the One-7.5 level were significantly higher than those of the One-3.0 level (t(17)=6.74,p<.001, Hedge's g = −1.079) and the Two-7.5 level (t(17)=4.27,p<.002, Hedge's g = 1.202).

5.9.3. (c) The sense of that the timing is natural

The results of the evaluation of whether the timing of the robot question was natural are shown in Figure . The means of the sense at the One-3.0 level were 3.39 (SD=1.290) at the S1 level, 2.78 (SD=1.555) at the S2 level, 3.44 (SD=1.042) at the S3 level, and 3.33 (SD=1.188) at the S4 level. The means of the sense at the One-7.5 level were 2.44 (SD=1.294) at the S1 level, 3.11 (SD=1.323) at the S2 level, 2.94 (SD=1.434) at the S3 level, and 2.56 (SD=1.338) at the S4 level. The means of the sense at the Two-7.5 level were 3.72 (SD=0.895) at the S1 level, 3.50 (SD=0.985) at the S2 level, 3.89 (SD=0.900) at the S3 level, and 3.72 (SD=1.227) at the S4 level.

Figure 10. Results of the natural timing of main utterance. Error bars mean 95% CI.

Figure 10. Results of the natural timing of main utterance. Error bars mean 95% CI.

We found a main effect of the method factor (F(2,34)=6.08,p=.006,ηp2=0.264). There was no main effect on the detective behavior factor (F(3,54)=0.972,p=.413,ηp2=0.054). and no interaction between those factors (F(6,102)=0.524,p=.0789,ηp2=0.030).

Multiple comparison showed that the score of the Two-7.5 level were significantly higher than those of the One-7.5 level (t(17)=3.60,p<.007, Hedge's g = −1.122).

5.10. Discussion

The results of (a) showed that participants felt the timing of robot's main topic was too early on One-3.0 more than One-7.5 and Two-7.5. The tendency was observed on S2 as we predicted. The results suggest that when a participant was under a high cognitive load, the participant would prefer to start a conversation with a little delay.

Meanwhile, the results of (b) showed that participants felt the timing of robot's main topic was too late in One-7.5 more than in One-3.0. The tendency was observed across all levels of the detective behavior factor. Regardless of cognitive load, the long time between the call and the main utterance seemed to make people feel that it was too late.

Furthermore, even though the timing of One-7.5 and Two-7.5 was the same, participants felt more natural (shown in the item of (c)) in Two-7.5 than One-7.5. This outcome demonstrates the effectiveness of our proposed approach.

6. General discussion

6.1. Implication

Our proposed cooperative behavior patterns by multi-robots is not a perfect solution to start a conversation naturally but an interesting approach that has a strong effect on alleviating unnatural impressions in cases where false negative recognition of an addressee's attention often occur.

In the future, these recognition technologies will continuously improve. However, it is difficult to recognize human status perfectly, as even humans sometimes also fail to do this. We believe that our study showed a novel approach for using multi-robot interaction to deal with the difficulties, and the experimental results suggested that the approach is effective. For example, Hong et al. [Citation21] propose an HRI architecture for emotional communication based on multimodal sensing. In their architecture, they show how to combine feature extraction of body language using a Kinect 3D sensor and vocal intonation using a microphone to estimate user emotions. Although this work does not mention initiating a conversation with the user, we expect that our proposed method could be incorporated into such an architecture to autonomously and naturally initiate a conversation with the user. Furthermore, In addition to providing information from the robot, our method may be used in cooperative work situations between humans and robots [Citation22], which have been rapidly developing in recent years. For example, in a situation where a human and a robot are working in the same space, it is not always the case that the human immediately pays attention to the robot's call. In such a situation, another robot can be made to talk to the human robot to create a pause, as shown in this study.

6.2. Cost problems of using multiple robots

Implementing two robots just to initiate a conversation naturally would be difficult to tolerate from a cost-effectiveness standpoint. However, there are various advantages to using two robots. For example, it is known that utilizing two robots can effectively convey information to the user [Citation16], improve the user's sense of dialogue participation [Citation23], conceal dialogue breakdowns caused by speech recognition errors [Citation24], promote dialogue continuity [Citation25], increase the effectiveness of praise [Citation26] and apologies from the robots [Citation27], and provide a more enjoyable experience for users [Citation28]. In addition to these advantages, this study newly shows that a conversation can be initiated naturally using two robots. Considering the above wide variety of advantages, there is a good possibility that multiple robots will be introduced into commercial, nursing, and educational facilities. Therefore, we believe that the disadvantages of using two robots will have an increasingly minor impact in future research.

6.3. Use of large language models

In this study, LLMs can be used to generate conversations between two robots, thereby reducing the effort of creating conversational sentences between robots. In our experiments, we had the robots utter pre-created sentences to keep the conditions almost the same, except for the number of robots and the main topic timing. This method is problematic in terms of sentence production effort when considering long-term use in practical situations. In a situation where the user interacts with the robot every day, if the robot utters the same sentence every time, the user will become bored, so it is necessary to prepare a variety of sentences. In general, the effort to create these sentences is significant. Furthermore, to create conversational sentences for each environment is labor-intensive. For example, in a facility for the elderly, conversational sentences may be created to attract the attention of elderly people who have difficulty responding to calls. However, it is difficult to use the same conversational sentences in a different environment, such as a children's rehabilitation facility or a commercial facility, and it is necessary to create conversational sentences that are appropriate for each environment. In response to these problems, LLMs can generate conversation sentences that are appropriate for the situation, based on the location and purpose of the robots and the instruction, ‘Please generate short conversation sentences of about two turns between robots.’ Since the conversation sentences generated will be different each time, even in the same situation, the possibility of user boredom is greatly reduced. Thus, by utilizing LLM, the proposed method will be easier to use for longer-term interactions and in a variety of situations.

6.4. Experimenter effect

It would then be important to consider that the bias generated by looking at the One-3.0 level (or One-7.5 level) and then looking at the Two-7.5 level, or vice versa. Basically, in a within-participant experiment, the experimenter's intentions may leak through to the participants as they experience multiple conditions (experimenter effect). For example, participants might infer that the experimenter wants them to evaluate the two robots well in this experiment. Although it is difficult to examine this effect directly, we believe that participants were not subject to the experimenter effect, given the nature of the measurement items used in the evaluation. In Experiment 1, participants were asked about the timing of the main utterances, and in Experiment 2, participants were asked whether the timing of the main utterances was too early, too late, or natural. If the experimenter effect was strongly received, one would expect the Two-7.5 level to be rated higher on all levels of the stimulus timing factor, but in fact the One-3.0 and One-7.5 levels were rated higher than the Two-7.5 level for appropriate stimulus timing. In other words, it is likely that participants rated the timing as they felt it was good, early, late, and natural, without any concern for the experimenter. Therefore, we believe that our proposed method was evaluated without bias due to the experimenter effect.

6.5. Limitation

This experiment assumed that the participant always reacted to the robot's main topic. In a real environment, there are possible situations where people are not interested in or not aware of the robot. The experiment did not consider the effectiveness of our approach for such people or conditions.

We fixed the length of the human–robot interaction in this experiment because we intended to investigate the effectiveness of a basic situation. To apply our approach in a real environment, we need to use a recognition technique for the addressee's attention. This investigation will be the topic of future research.

This study did not ask questions about the participants' age, ethnicity, department, etc. Therefore, it is not possible to include any information other than the gender of the participants and the fact that the participants were students at the authors' university. While this omission is problematic in terms of providing more detailed information, it is unlikely that a difference in age or ethnicity of a few years would have an effect on the perceived level of good timing. Therefore, we believe that the lack of demographic information does not detract from the value of this study.

Furthermore, we need to investigate if one-robot monologues are available as well as two-robot dialogue. As we considered Section 3, it may be possible to fill the interval between the call and the main utterance with a single robot monologue. We should consider it as future work that what variations are more likely to give room for interpretation, and how different the effects of monologue and dialogues are.

7. Conclusion

This study proposed an approach for alleviating the negativity of people's impression of a robot starting a conversation. In the approach, after a robot speaks to attract a person's attention, the robot and another robot display for the person a short communication sequence between the robots until the first robot speaks on a main topic. Through an experiment to investigate the effectiveness of our approach, we found our proposed approach enabled the start of a conversation, by making an addressee ready to listen without causing an unnatural impression. The results suggested the effectiveness of our proposed approach using multi-robot cooperative behaviors, and it will contribute to HRI as a new approach to deal with starting a conversation naturally.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research was funded by JSPS KAKENHI [grant numbers 22H03895 and 19H05691].

Notes on contributors

Takamasa Iio

Takamasa Iio received a PhD degree from Doshisha University, Kyoto, Japan, in 2012. Then, he has worked at Intelligent Robotics and Communication Laboratories, ATR, Osaka University, and the University of Tsukuba. Currently, he is an associate professor at Doshisha University, Kyoto, Japan. His field of expertise is social robotics. He is interested in how people's cognition and behavior change through interaction with social robots and how human society changes.

Yuichiro Yoshikawa

Yuichiro Yoshikawa received the PhD degree in engineering from Osaka University, Japan, in 2005. From 2010, He has been an Associate Professor in the Graduate School of Engineering Science, Osaka University. From 2014, he has been a project coordinator of JST ERATO Ishiguro Symbiotic Human–Robot Interaction Project. His research interests include interactive robotics, therapeutic robots for individual with developmental disorders, and cognitive developmental robotics.

Hiroshi Ishiguro

Hiroshi Ishiguro received a DEng in systems engineering from the Osaka University, Japan in 1991. He is currently Professor of Department of Systems Innovation in the Graduate School of Engineering Science at Osaka University ( 2009–) and Distinguished Professor of Osaka University ( 2017–). He is also visiting Director ( 2014–) (group leader: 2002–2013) of Hiroshi Ishiguro Laboratories at the Advanced Telecommunications Research Institute and an ATR fellow. His research interests include sensor networks, interactive robotics, and android science. He received the Osaka Cultural Award in 2011. In 2015, he received the Prize for Science and Technology (Research Category) by the Minister of Education, Culture, Sports, Science and Technology (MEXT).

Notes

References

  • Burgard W, Cremers AB, Fox D, et al. Experiences with an interactive museum tour-guide robot. Artif Intell. 1999;114(1–2):3–55. doi: 10.1016/S0004-3702(99)00070-3
  • Thrun S, Bennewitz M, Burgard W, et al. Minerva: a second-generation museum tour-guide robot. In: Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No. 99CH36288C); Vol. 3. IEEE; 1999.
  • Siegwart R, Arras KO, Bouabdallah S, et al. Robox at expo. 02: a large-scale installation of personal robots. Rob Auton Syst. 2003;42(3–4):203–222. doi: 10.1016/S0921-8890(02)00376-7
  • Gockley R, Bruce A, Forlizzi J, et al. Designing robots for long-term social interaction. In: 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE; 2005. p. 1338–1343.
  • Gross HM, Boehme HJ, Schröter C, et al. Shopbot: progress in developing an interactive mobile shopping assistant for everyday use. In: 2008 IEEE International Conference on Systems, Man and Cybernetics; IEEE; 2008. p. 3471–3478.
  • Shiomi M, Sakamoto D, Kanda T, et al. A semi-autonomous communication robot: a field trial at a train station. In: Proceedings of the 3rd ACM/IEEE International Conference on Human Robot Interaction, Amsterdam The Netherlands; 2008. p. 303–310.
  • Kanda T, Shiomi M, Miyashita Z, et al. A communication robot in a shopping mall. IEEE Trans Robot. 2010;26(5):897–913. doi: 10.1109/TRO.2010.2062550
  • Goodwin C.. Restarts, pauses, and the achievement of a state of mutual gaze at turn-beginning. Sociol Inq. 1980;50(3–4):272–302. doi: 10.1111/soin.1980.50.issue-3-4
  • Sidner CL, Kidd CD, Lee C, et al. Where to look: a study of human–robot engagement. In: Proceedings of the 9th International Conference on Intelligent User Interfaces, Funchal, Madeira, Portugal; 2004. p. 78–84.
  • Nakano YI, Ishii R. Estimating user's engagement from eye-gaze behaviors in human–agent conversations. In: Proceedings of the 15th International Conference On Intelligent User Interfaces, Hong Kong, China; 2010. p. 139–148.
  • Yamazaki K, Kawashima M, Kuno Y, et al. Prior-to-request and request behaviors within elderly day care: implications for developing service robots for use in multiparty settings. In: ECSCW 2007: Proceedings of the 10th European Conference on Computer-Supported Cooperative Work, Limerick, Ireland, 24–28 Sept 2007; Springer; 2007. p. 61–78.
  • Bergstrom N, Kanda T, Miyashita T, et al. Modeling of natural human–robot encounters. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE; 2008. p. 2623–2629.
  • Kompatsiari K, Ciardo F, De Tommaso D, et al. Measuring engagement elicited by eye contact in human–robot interaction. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE; 2019. p. 6979–6985.
  • Kuzuoka H, Suzuki Y, Yamashita J, et al. Reconfiguring spatial formation arrangement by robot body orientation. In: 2010 5th ACM/IEEE International Conference on Human–Robot Interaction (HRI); IEEE; 2010. p. 285–292.
  • Shi C, Shimada M, Kanda T, et al. Spatial formation model for initiating conversation. In: Proceedings of Robotics: Science and Systems VII, Los Angeles, California, USA; 2011. p. 305–313.
  • Sakamoto D, Hayashi K, Kanda T, et al. Humanoid robots as a broadcasting communication medium in open public spaces. Int J Soc Robot. 2009;1:157–169. doi: 10.1007/s12369-009-0015-5
  • Shiomi M, Hagita N. Do synchronized multiple robots exert peer pressure? In: Proceedings of the Fourth International Conference on Human Agent Interaction, Singapore; 2016. p. 27–33.
  • Iio T, Yoshikawa Y, Ishiguro H. Pre-scheduled turn-taking between robots to make conversation coherent. In: Proceedings of the Fourth International Conference on Human Agent Interaction, Singapore; 2016. p. 19–25.
  • Karatas N, Yoshikawa S, De Silva PRS, et al. Namida: multiparty conversation based driving agents in futuristic vehicle. In: Human–Computer Interaction: Users and Contexts: 17th International Conference, HCI International 2015, Los Angeles, CA, USA, Aug 2–7, 2015, Proceedings, Part III 17; Springer; 2015. p. 198–207.
  • Sacks H, Schegloff EA, Jefferson G. A simplest systematics for the organization of turn taking for conversation. In: Studies in the Organization of Conversational Interaction. Cambridge, MA: Elsevier; 1978. p. 7–55.
  • Hong A, Lunscher N, Hu T, et al. A multimodal emotional human–robot interaction architecture for social robots engaged in bidirectional communication. IEEE Trans Cybern. 2020;51(12):5954–5968. doi: 10.1109/TCYB.2020.2974688
  • Ikumapayi OM, Afolalu SA, Ogedengbe TS, et al. Human–robot co-working improvement via revolutionary automation and robotic technologies–an overview. Procedia Comput Sci. 2023;217:1345–1353. doi: 10.1016/j.procs.2022.12.332
  • Iio T, Yoshikawa Y, Ishiguro H. Retaining human–robots conversation: comparing single robot to multiple robots in a real event. J Adv Comput Intell Intell Inform. 2017;21(4):675–685. doi: 10.20965/jaciii.2017.p0675
  • Iio T, Yoshikawa Y, Ishiguro H. Double-meaning agreements by two robots to conceal incoherent agreements to user's opinions. Adv Robot. 2021;35(19):1145–1155. doi: 10.1080/01691864.2021.1974939
  • Iio T, Yoshikawa Y, Chiba M, et al. Twin-robot dialogue system with robustness against speech recognition failure in human–robot dialogue with elderly people. Appl Sci. 2020;10(4):1522. doi: 10.3390/app10041522
  • Shiomi M, Okumura S, Kimoto M, et al. Two is better than one: social rewards from two agents enhance offline improvements in motor skills more than single agent. PloS One. 2020;15(11):e0240622. doi: 10.1371/journal.pone.0240622
  • Okada Y, Kimoto M, Iio T, et al. Two is better than one: apologies from two robots are preferred. Plos One. 2023;18(2):e0281604. doi: 10.1371/journal.pone.0281604
  • Velentza AM, Heinke D, Wyatt J. Human interaction and improving knowledge through collaborative tour guide robots. In: 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN); IEEE; 2019. p. 1–7.