As Artificial Intelligence becomes more widespread and pervasive, the transition to a data-driven age poses a conundrum for many: Will AI replace me at my job? Can it become smarter than humans? Who is making the important decisions, and who is accountable?
AI is becoming more and more complex, and tools like ChatGPT, Siri, and Alexa are already a part of everyday life to an extent where even experts struggle to grasp and explain the functionality in a tangible way. How can we expect the average human to trust such a system? Trust matters not only in decision-making processes but also in order for societies to be successful. Ask yourself this question: Who would you trust with a big personal or financial decision?
Today’s banking counseling sessions are associated with various challenges: Besides preparation and follow-up, the consultant is also busy with many different tasks during the conversation. The cognitive load is high, and tasks are either done on paper or with a personal computer, which is why the consultant can’t engage sufficiently with the client. Clients are mostly novices who are not familiar with the subject matter. The consequent state of passivity or uncertainty often stems from a phenomenon known as information asymmetry, which occurs when the consultant has more or better information than the client.
In this article, we propose a new approach based on co-creation and collaboration in advisory services. An approach that enables the consultant to simply focus on the customers’ needs by leveraging the assistance of a digital agent. We explore the opportunities and limitations of integrating a digital agent into an advisory meeting in order to allow all parties to engage actively in the conversation.
Rethinking Human-Machine Environments In Advisory Services
Starting from the counseling session described above, we tackled the issues of information asymmetry, trust building, and cognitive overload within the context of a research project.
Understanding the linguistic landscape of Switzerland with its various Swiss-German dialects, the digital agent “Mo” supports consultants and clients in banking consultations by taking over time-consuming tasks, providing support during the consultation, and extracting information. By means of an interactive table, the consultation becomes a multimodal environment in which the agent acts as a third interaction partner.
The setup enables a collaborative exchange between interlocutors, as information is equally visible and accessible to all parties (shared information). Content can be placed anywhere on the table through natural, haptic interactions. Whether the agent records information in the background, actively participates in the composition of a stock portfolio, or warns against risky transactions, Mo “sits” at the table throughout the entire consultation.
To promote active participation from all parties during the counseling session, we have pinpointed crucial elements that facilitate collaboration in a multi-party setting:
All information is made equally visible and interactable for all parties.
Collaborative Digital Agent
By using human modes of communication, social cues, and the support of local dialects, the agent becomes accessible and accepted.
Comprehensible User Interfaces
Multimodal communication helps to convey information in social interactions. Through the use of different output channels, we can convey information in different complexities.
Speech Patterns for Voice User Interfaces
Direct orders to an AI appear unnatural in a multi-party setting. The use of different speech and turn-taking patterns allows the agent to integrate naturally into the conversation.
In the next sections, we will take a closer look at how collaborative experiences can be designed based on those key factors.
“Hello Mo”: Designing Collaborative Voice User Interfaces
Imagine yourself sitting at the table with your bank advisor in a classic banking advisory meeting. The consultant tries to explain to you a ton of banking-specific stuff, all while using a computer or tablet to display stock price developments or to take notes on your desired transactions. In this setup, it is hard for consultants to keep up a decent conversation while retrieving and entering data into a system. This is where voice-based interactions save the day.
When using voice as an input method during a conversation, users do not have to change context (e.g., take out a tablet, or operate a screen with a mouse or keyboard) in order to enter or retrieve data. This helps the consultant to perform a task more efficiently while being able to foster a personal relationship with the client. However, the true strength of voice interactions lies in their ability to handle complex information entry. For example, purchasing stocks requires an input of multiple parameters, such as the title or the number of shares. Where in a GUI, all of these input variables have to be tediously entered by hand, VUIs offer an option of entering everything with one sentence.
Nonetheless, VUIs are still uncharted territory for many users and are accordingly viewed with a huge amount of skepticism. Thus, it is important to consider how we can create voice interactions that are accessible and intuitive. To achieve this goal, it is essential to grasp the fundamental principles of voice interaction, such as the following speech patterns.
Command and Control
This pattern is widely used by popular voice assistants such as Siri, Alexa, and Google Assistant. As the name implies, the assistants are addressed with a direct command — often preceded by a signal “wake word.” For example,
“Hey, Google” → Command: “Turn on the Bedroom Light”
The Conversational Pattern, in which the agent understands intents directly from the context of the conversation, is less common in productive systems. Nevertheless, we can find examples in science fiction, such as HAL (2001: A Space Odyssey) or J.A.R.V.I.S. (Iron Man 3). The agent can directly extract intent from natural speech without the need for a direct command to be uttered. In addition, the agent may speak up on his own initiative.
As the Command and Control approach is widely used in voice applications, users are more familiar with this pattern. However, utilizing the Conversational Pattern can be advantageous, as it enables users to interact with the agent effortlessly, eliminating the requirement for them to be familiar with predefined commands or keywords, which they may formulate incorrectly.
In our case of a multi-party setting, users perceived the Conversational Pattern in the context of transaction detection as surprising and unpredictable. For the most part, this is due to the limitations of the intent recognition system. For example, during portfolio customization, stock titles are discussed actively. Not every utterance of a stock title corresponds to a transaction, as the consultant and client are debating possibilities before execution. It is fairly difficult or nearly impossible for the agent to distinguish between option and intent. In this case, command structures offer more reliability and control at the expense of the naturalness of the conversation since the Command and Control Pattern results in unnatural interruption and pauses in the conversation flow. To get the best of both worlds (natural interactions and predictable behavior), we introduce a completely new speech pattern:
Typically, transaction intents are formulated according to the following structure:
Interlocutor 1: We then buy 20 shares of Smashing Media Stocks (intent).
Interlocutor 2: Yes, let’s do that (confirmation).
Interlocutor 1: All right then, let’s buy Smashing Media Stocks (reconfirmation).
In the current implementation of the Conversational Pattern, the transaction would be executed after the first utterance, which was often perceived to be irritating. In the Conversational Confirmation pattern, the system waits for both parties to confirm and executes the transaction only after the third utterance. By adhering to the natural rules of human conversation, this approach meets the users’ expectations.
Regarding the users’ mental model of digital agents, the Command and Control Pattern provides users with more control and security.
The Command and Control Pattern is suitable as a fallback in case the agent does not understand an intent.
The Conversational Pattern is suitable when information has to be obtained passively from the conversation. (logging)
For collaborative counseling sessions, the Conversational Confirmation Pattern could greatly enhance the counseling experience and lead to a more natural conversation in a multi-party setting.
Sharing Is Caring: The Concept Of The Shared Device
In a world where personal devices such as PCs, mobile phones, and tablets are prevalent, we have grown accustomed to interacting with technical devices in “single-player mode.” The use of private devices undoubtedly has its advantages in certain situations (as in not having to share the million cute cats we google during work with our boss). But when it comes to collaborative tasks — sharing is caring.
Put yourself back into the previously described scenario. At some point, the consultant is trying to show stock price trends on the computer or tablet screen. However, regardless of how the screen is positioned, at least one of the participants has limited vision. Due to the fact that the computer is a personal device of the consultant, the client is excluded from actively engaging with it — leading to the problem of unequal distribution of information.
By integrating an interactive tabletop projection into the consultation meeting, we aimed to overcome the limitations of “personal devices,” improving trust, transparency, and decision empowerment. It is essential to understand that human communication relies on various channels, i.e., modalities (voice, sight, body language, and so on), which help individuals to express and comprehend complex information more effectively. The interactive table as an output system facilitates this aspect of human communication in the digital-physical realm. In a shared device, we use the physical space as an interaction modality. The content can be intuitively moved and placed in the interaction space using haptic elements and is no longer bound to a screen. These haptic tokens are equally accessible to all users, encouraging especially novice users to interact and collaborate on a regular tabletop surface.
The interactive tabletop projection also makes information more comprehensible for users. For example, during the consultation, the agent updates the portfolio visualization in real time. The impact of a transaction on the overall portfolio can be directly grasped and pulled closer by the client and advisor and used as a basis for discussion.
A result is a transparent approach to information, which increases the understanding of bank-specific and system-specific processes, consequently improving trust in the advisory service and leading to more interaction between customer and advisor.
Apart from the spatial modality, the proposed mixed reality system provides other input and output channels, each with its unique characteristics and strengths. If you are interested in this topic this article on Smashing provides a great comparison of VUIs and GUIs and when to use which.
The proposed mixed reality system fosters collaboration since:
Information is equally accessible to all parties (reducing information asymmetry, fostering shared understanding, and building trust).
One user interface can be operated collectively by several interaction partners (engagement).
Multisensory human communication can be transferred to the digital space (ease of use).
Information can be better comprehended due to multimodal output (ease of use).
Next Stop: Collaborative AI (Or How To Make A Robot Likable)
For consultation services, we need an intelligent agent to reduce the consultant’s cognitive load. Can we design an agent that is trustworthy, even likable, and accepted as a third collaboration partner?
Empathy For Machines
Whether it’s machines or humans, empathy is crucial for interactions, and social cues are the salt and pepper to achieve this. Social cues are verbal or nonverbal signals that guide conversations and other social interactions by influencing our perceptions of and reactions toward others. Examples of social cues include eye contact, facial expressions, tone of voice, and body language. These impressions are important communicative tools because they provide social and contextual information and facilitate social understanding. In order for the agent to appear approachable, likable, and trustworthy, we have attempted to incorporate social elements while designing the agent. As described above, social cues in human communication are transported through different channels. Transferring to the digital context once again requires the use of multimodality.
The visual manifestation of the agent enables the elaboration of character-defining elements, such as facial expressions and body language in digital space, analogous to the human body. Highlighting important context information, such as indicating system status.
In terms of voice interactions, social cues play an important role in system feedback. For example, a common human communication practice is to confirm an action by stating a short “mhm” or “ok.” Applying this practice to the agent’s behavior, we tried to create a more transparent and natural feeling VUI.
When designing voice interactions, it’s important to note that the agent’s perception is heavily influenced by the speech pattern utilized. Once the agent is addressed with a direct command, it is assigned a subordinate role (servant) and is no longer perceived as an equal interaction partner. Recognizing the intent of the conversation independently, the agent is perceived as more intelligent and trustworthy.
Mo: Ambassador Of System Transparency
Despite great progress in Swiss German speech recognition, transaction misrecognition still occurs. While dealing with an imperfect system, we have tried to take advantage of it by leveraging the agent to make system-specific processes more understandable and transparent. We implemented the well-known usability heuristic: the more comprehensible system-specific processes are, the better the understanding of a system and the more likely users feel empowered to interact with it (and the more they trust and accept the agent).
A core activity of every banking consultation meeting is the portfolio elaboration phase, where the consultant, client, and agent try to find the best investment solutions. In the process of adjusting the portfolio, transactions get added and removed with the helping hand of the agent. If “Mo” is not fully confident of a transaction, “Mo” checks in and asks whether the recognized transaction has been understood correctly.
The agent’s voice output follows the usual conventions of a conversation: as soon as an interlocutor is unsure regarding the content of a conversation, he or she speaks up, politely apologizes, and asks if the understood content corresponds to the intent of the conversation. In case the transaction was misunderstood, the system offers the possibility to correct the error by adjusting the transaction using touch and a scrolling token (Microsoft Dial). We deliberately chose these alternative input methods over repeating the intent with voice input to avoid repetitive errors and minimize frustration. By giving the user the opportunity to take action and be in control of an actual error situation, the overall acceptance of the system and the agent are strengthened, creating a nutritious soil for collaboration.
Social cues provide opportunities to design the agent to be more approachable, likable, and trustworthy. They are an important tool for transporting context information and enabling system feedback.
Making the agent part of explaining system processes helps improve the overall acceptance and trust in both the agent and the system (Explainable AI).
Towards The Future
Irrespective of the specific consulting field, whether it’s legal, healthcare, insurance, or banking, two key factors significantly impact the quality of counseling. The first factor involves the advisor’s ability to devote undivided attention to the client, ensuring their needs are fully addressed. The second factor pertains to structuring the counseling session in a manner that facilitates equal access to information for all participants, presenting it in a way that even inexperienced individuals can understand. By enhancing customer experience through promoting self-determined and well-informed decision-making, businesses can boost customer retention and foster loyalty.
Introducing a shared device in counseling sessions offers the potential to address the problem of information asymmetry and promote collaboration and a shared understanding among participants. Does this mean that every consultation session depends on the proposed mixed reality setup? For physical consultations, the interactive tabletop projection (or an equivalent interaction space where all participants have equal access to information) does enable a democratic approach to information — personal devices just won’t do the job.
In the context of digital (remote) consultations, collaboration, and transparency remain crucial, but the interaction space undergoes significant changes, thereby altering the requirements. Regardless of the specific interaction space, careful consideration must be given to conveying information in an understandable manner. Utilizing different modalities can enhance the comprehensibility of user interfaces, even in traditional mobile or desktop UIs.
To alleviate the cognitive load on consultants, we require a system capable of managing time-consuming tasks in the background. However, it is important to acknowledge that digital agents and voice interactions remain unfamiliar territory for many users, and there are instances where voice processing falls short of users’ high expectations. Nevertheless, speech processing will certainly see great improvements in the next few years, and we need to start thinking today about what tomorrow’s interactions with voice assistants might look like.
Further Reading On SmashingMag
Everything You Want To Know About Creating Voice User Interfaces, Nick Babich & Gleb Kuznetsov
An Alternative Voice UI To Voice Assistants, Ottomatias Peura
Creating Voice Skills For Google Assistant And Amazon Alexa, Tris Tolliday
How To Use Face Motion To Interact With Typography, Edoardo Cavazza