Author: DW Innovation
The most excited we've seen the teenage brother of a consortium member this year, was when the makers of Baldur's Gate, a famous role-playing game (RPG), made the following announcement: Henceforth, players can change their avatars during the game. Gaming studio Patch 3 explained that with the new update, it'll be possible to customize characters even after exiting the respective editor. That's a novel thing in the world of avatar-based games, a feature enabled by the progress made in the field of generative AI. This technology now greatly simplifies the process of building virtual environments and its characters, thus enhancing the range of creative choices.
Avatars, AI, and the metaverse
So what is an avatar? Simply put, it's a virtual (maybe slightly fictitious) representation of yourself, capable of performing actions (e.g. engaging in conversations) in a digital space. It can also function as a bot modelled on a human counterpart. Or as an entirely virtual entity/agent without any blueprint in the physical world. Depending on the use case, generative AI can help design and change appearance, voices, movements, as well as thoughts and dialogues of an avatar in what is often referred to as the metaverse now. The concept of the avatar is actually decades old (think: Ultima, Second Life, World of Warcraft etc.), but it's only now – in the 2020s – that avatars are becoming an integral part of all kinds of digital services and entertainment. How did we get to this point?
3D, XR, the pandemic, and "the way of the future"
A very brief explanation could go like this: Advancements in computer graphics and streaming had already enabled a shift from 2D to 3D in terms of avatars and agents and catapulted video calls into the mainstream in the late 2010s. And then, in 2020, the pandemic hit. People were isolated. And people quickly realized that social interaction is an essential and fundamental need. So a lot of companies started pumping a lot of money into extended reality (XR) tech and collaborative online spaces. In this context, popular gaming platforms like Roblox also invested heavily into the expansion of avatar tech (and there didn't seem to be any problems in funding all this).
Last year, an AI avatar of Elvis Presley thrilled the audience of "America's Got Talent". Created by a singer and a technologist working together, it gave a convincing performance of the song "Hound Dog". In London’s West End, the ABBA show "Voyage", which features all four members of the Swedish band in avatar form, has hit the mark of one million spectators. ABBA's Björn Ulvaeus says that avatars are "the way of the future". And Tom Hanks is convinced that he will act beyond his grave – in the form of an AI-driven avatar.
Metaverse platform Ready Player Me (named after the 2011 VR sci-fi bestseller Ready Player One), recently launched its generative AI avatar creator. And similar tools are popping up just everywhere. Another one is Magic Avatars, a service that creates realistic 3D avatars from a single photo. There's also Synthesia, the most advanced AI avatar video maker in the market right now, with more than 120 languages and accents available. The NBA created new AI tech which allows fans to have their very own avatar replace a regular player in a game – and Microsoft recently introduced an avatar generator for MS teams. These avatars are the harbingers of Microsoft Mesh, the Microsoft metaverse that's not available yet. With their avatar, users are supposed to take part in XR meetings and work collaboratively, without having to turn on their cameras. Microsoft advertises that employees can appear in any form they choose and that feels most comfortable, which is intended to promote diversity and inclusion.
Inclusive avatars
Speaking of which, avatars are also used as sign language interpreters now. Israeli startup CODA aims to make video watching a lot easier for the deaf and hearing-impaired community. To this end, the company enlists AI-driven avatars able to translate spoken language into sign language almost instantaneously. That's next level because up until recently, similar companies like signer.ai (India) or Signapse (UK) only offered to translate written text into sign language.
What's next?
In a nutshell, there are all kinds of (more or less) sophisticated avatars for all kinds of purposes now, but one big challenge remains: Creating an avatar / 3D character / virtual agent (call it what you like) which is portable, sustainable, versatile – and can be used in multiple environments. A solution like this would be very appealing to users facing a complex, rapidly growing digital/immersive space. Hopefully, SERMAS can make a positive contribution here.
In the meantime, let's all enjoy playing a little with 3D authoring tools, generative AI and immersive RPG worlds.
Authors: Valeria Villani, Beatrice Capelli, Lorenzo Sabattini, UNIMORE, 2023.
The growing spread of robots for service and industrial purposes calls for versatile, intuitive and portable interaction approaches. In particular, in industrial environments, operators should be able to interact with robots in a fast, effective, and possibly effortless manner. To this end, reality enhancement techniques have been used to achieve efficient management and simplify interactions, in particular in manufacturing and logistics processes.
Building upon this, in this paper we propose a system based on mixed reality that allows a ubiquitous interface for heterogeneous robotic systems in dynamic scenarios, where users are involved in different tasks and need to interact with different robots. By means of mixed reality, users can interact with a robot through manipulation of its virtual replica, which is always colocated with the user and is extracted when interaction is needed.
The system has been tested in a simulated intralogistics setting, where different robots are present and require sporadic intervention by human operators, who are involved in other tasks. In our setting we consider the presence of drones and AGVs with different levels of autonomy, calling for different user interventions.
The proposed approach has been validated in virtual reality, considering quantitative and qualitative assessment of performance and user's feedback.
Link to the publication: https://arxiv.org/abs/2307.05280
Publication in the conference IEEE International Conference on Systems Man and Cybernetics 2023
Authors: Mohsen Mesgar, Thy Thy Tran, Goran Glavaš, and Iryna Gurevych. 2023
In task-oriented dialogue (ToD) new intents emerge on a regular basis, with a handful of available utterances at best. This renders effective Few-Shot Intent Classification (FSIC) a central challenge for modular ToD systems. Recent FSIC methods appear to be similar: they use pre-trained language models (PLMs) to encode utterances and predominantly resort to nearest-neighbor-based inference. However, they also differ in major components: they start from different PLMs, use different encoding architectures and utterance similarity functions, and adopt different training regimes.
The coupling of these vital components together with the lack of informative ablations prevents the identification of factors that drive the (reported) FSIC performance. We propose a unified framework to evaluate these components along the following key dimensions:(1) Encoding architectures: Cross-Encoder vs. Bi-Encoders;(2) Similarity function: Parameterized (i.e., trainable) vs. non-parameterized; (3) Training regimes: Episodic meta-learning vs conventional (i.e., non-episodic) training. Our experimental results on seven FSIC benchmarks reveal three new important findings. First, the unexplored combination of cross-encoder architecture and episodic meta-learning consistently yields the best FSIC performance. Second, episodic training substantially outperforms its non-episodic counterpart.
Finally, we show that splitting episodes into support and query sets has a limited and inconsistent effect on performance. Our findings show the importance of ablations and fair comparisons in FSIC. We publicly release our code and data.
Read the publication here.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
Is there a shift in media organizations regarding the use of immersive tech?
Author: The DW Innovation Team
After roughly ten years of XR in international journalism, it has become pretty hard to compile a comprehensive list of interesting and influential experiments. From New York to Cologne and from Helsinki to Lyons (to name just a few hotspots where colleagues of ours used to work on projects), lots of newsrooms and R&D departments have worked with immersive tech. They took us to food banks in California and to landing sites on Mars. They put us in prison cells and on refugee boats in the Mediterranean. They showed antique artefacts, pro athletes, and the rainforest. They used gaming engines, iOS and Android SDKs, XR-enhanced browsers, cardboard contraptions, tethered headsets, stand-alone headsets, and smartphones. However, no matter how their story was produced and what it focused on, it was exactly that: a story. The better part of the XR decade was apparently dedicated to immersive reporting/journalism.
Even though it's hard to say at this point if we're looking at a real trend, it seems that the last couple of years saw at least a slight shift: away from XR as a storytelling device – and towards XR as an infrastructure technology. What may have started with a couple of immersive, avatar-driven meetings during peak COVID gradually moved to other areas of media operations, first and foremost: media production planning and media training.
Two recent R&D projects in DW's portfolio seem to be proof of this. The first one is XR4DRAMA (2020-2023) which heavily relied on AR, VR, and 3D models to improve situational awareness for reporters and filmmakers. The second one is actually SERMAS, which uses XR tech and avatars to improve and scale-up security training for journalists. At the same time, DW is still using XR storytelling tools (e.g. Fader in MediaVerse) and looking into new kinds of immersive reporting (e.g. the Snap Spectacles in a new lab project). However, the major part of the innovation's teams time and budget is currently spent on the aforementioned infrastructure projects.
The XR playground has clearly been expanded, and more changes are underway.
Author: The TUDa Team
With groundbreaking dialogue applications such as ChatGPT, a question is increasingly being asked – whether dialogue agents will make humans redundant. This is a question we need to ask ourselves and one that will certainly become more relevant in the coming years. As such, it is worth pointing out the obvious weaknesses of current large-scale language models, especially when it comes to assessing whether they could make human professionals redundant.
When it comes to the potential replacement of human programmers, the evidence is sobering. Large language models are not yet ready to replace programmers, nor will they be in the near future. The recent GitHub Copilot showed that while models can provide good code suggestions, the code can still contain errors that need to be corrected by a human programmer. While progress is being made in code generation, and larger models seem to have better code generation capabilities than smaller models, we are still far from the point where long, complex programs can be generated from scratch. For high-stakes programming scenarios, such as critical infrastructure design, the transition to the use of AI programmers would be even slower.
In the case of writing-intensive fields such as journalism, there may be an opportunity to reduce the human workload by having language models generate templates for articles. However, their feasibility for factual fields will depend on the trustworthiness they gain in the future. For now, it is known that these models will continue to hallucinate, and the problem is not fully solved, so until it is, language models will not be able to replace a human journalist.
At the same time, large language models could greatly boost the productivity of human professionals. An example is the use of language models in creative writing, such as providing suggestions or ideas to a human writer. Creativity often benefits from brainstorming, which a language assistant could help with. But ultimately, creative fields are driven by the natural human desire for artistic expression, so there will always be a human involved, no matter how skilled the models become.
The world of technology has made vast strides in the last decade, with innovative new ideas and advances in fields such as Extended Reality (XR), Virtual Reality (VR), Augmented Reality (AR) and robotics. However, while there are some examples of robotics being used in everyday life - such as automatic vacuum cleaners or lawnmowers, they are still not widely used. One of the reasons we feel that we encounter this difficulty of acceptance is that the ability to interact in a natural and intuitive way is lacking.
For example, this ability to interact is particularly relevant when it comes to public spaces such as offices, museums, and other institutions, where XR and robotics must be able to interact naturally with untrained users, forcing a level of trust and familiarity that can only come from careful design and implementation.
Another major obstacle preventing widespread adoption is the issue of trust. Many people are understandably concerned about technology spying or stealing their personal data. Thus, it is essential that we build trust in these technologies from the ground up, ensuring that they are designed to be secure and trustworthy.
Finally, the architecture of these systems is also important. Modular development is key to ensuring that these technologies can be customized for different situations while using the same underlying system can help people feel more comfortable and familiar with them.
In conclusion, while XR, VR, AR and robotics have enormous potential, there are still many obstacles to their widespread use in everyday life. Natural interaction, trust and careful design are all essential if we are to make these technologies more accessible and useful for everyone. At SERMAS, we are focusing on these topics with the primary objective of humanizing these systems and making them more socially acceptable to users, thus demystifying their problematic aspects.
Author: Spindox Labs
Extended Reality (XR) is supported by a multitude of platforms, ranging from mobile devices and desktops to virtual reality (VR) and augmented reality (AR) headsets. However, this diversity of platforms poses a significant challenge for developers. They must navigate multiple development environments, languages, and tools to create XR applications that work seamlessly across all platforms. Fortunately, multiplatform XR development tools have emerged to ease this burden.
Multiplatform XR development tools provide developers with a common framework for building XR applications across multiple platforms. These tools enable developers to write code once and deploy it across various devices, reducing development time and costs while ensuring consistent user experiences across devices. Some of the most popular multiplatform XR development tools include Unity, Unreal Engine, and Vuforia.
Unity is a popular development platform for building 2D, 3D, and XR applications. Unity allows developers to create content for various platforms, including iOS, Android, Windows, and macOS, as well as VR and AR devices. With Unity, developers can write code in C# and deploy it across multiple devices, making it a popular choice for XR development.
Unreal Engine is another widely used multiplatform development tool. Developed by Epic Games, Unreal Engine is a powerful engine for building high-quality, immersive XR experiences. It supports a variety of platforms, including mobile devices, desktops, consoles, and VR and AR devices. Unreal Engine also provides developers with a visual scripting language, making it an accessible option for those who prefer a more visual approach to development.
Vuforia is a popular AR development tool that allows developers to create AR applications for various platforms, including iOS, Android, and HoloLens. Vuforia provides developers with tools for image recognition, tracking, and rendering, enabling them to create AR experiences that integrate seamlessly with the physical world.
While these multiplatform XR development tools have simplified development across various devices, they still suffer from fragmentation. Each platform has its APIs (Application Programme Interfaces), rendering engines, and hardware requirements, making it challenging to create applications that work seamlessly across all devices.

Source: https://www.khronos.org/openxr
OpenXR, developed by Khronos Group, a consortium of leading technology companies, including Microsoft, Oculus, and Epic Games, among others, is an open, royalty-free standard for XR application development, that aims to simplify XR development by providing a common API that works across various platforms and devices, enabling developers to create XR applications that run smoothly across all platforms. This platform provides a unified API that abstracts the differences between various XR platforms, allowing developers to create applications that work seamlessly across different devices. This standard enables developers to write code once and deploy it across multiple platforms, reducing development time and costs while ensuring consistent user experiences across devices.
In conclusion, multiplatform XR development tools have revolutionized the XR development landscape by enabling developers to create content for various devices. However, these tools suffer from fragmentation, making it challenging to create applications that work seamlessly across all platforms. The OpenXR platform addresses this fragmentation by providing a unified standard for XR application development, enabling developers to create XR applications that work seamlessly across various devices. As XR continues to evolve, OpenXR will play a critical role in simplifying development and accelerating the growth of XR applications. OpenXR 1.0 was released to the public by Khronos Group at SIGGRAPH 2019