Introduction

Humans are different from other animals because we make and use tools to help us do specific jobs and understand things better. When solving complex real-world challenges, we use a diverse set of tools because problems rarely depend on a single area of expertise. Similarly, ChatHuman is not a robot, but rather, it is a Multi-Model Large Language Model that is really good at understanding people and how they move in 3D images. ChatHuman learns from lots of different tools and models that help it do things like figure out how people are feeling, guess how they’re moving, and understand when things touch each other. It is consistent with a multimodal LLM and 22 human-related tools.

In recent years, research on 3D humans has progressed rapidly as a result of the creation of many tools that perform specific tasks like estimating a human’s 3D pose from a single image (CLIFF, HMR, HMR2.0), predicting face/body shapes (SHAPY, DECA), capturing emotions (EMOCA), and identifying regions of touch/contact (TUCH, DECO). Each of these tools is called “Specialist” models because they focus on specific tasks. ChatHuman, referred to as a “Generalist,” can address a broad spectrum of problems by utilising specialist models. They know how to apply these models and can integrate the results from various tools to solve new challenges.

Figure 1: ChatHuman utilised a wide range of tools to solve 3D problems by discriminating and integrating their results. Article link: https://arxiv.org/abs/2405.04533 (Lin et al., 2024)

Illustrated Example: Use of Reasoning Pose Estimation Tool with HMR in Figure 1

1. User Query with Input Image: A user initiates a query, such as: “Please estimate the 3D pose of the person in the image.”

2. LLM Analysis: ChatHuman’s multi-modal LLM processes the input query and image with a view to what is being asked. It identifies what is to be done: The 3D pose estimation of a person in the image.

3. Tool Selection: After the query and the image, ChatHuman decides that HMR is the suitable tool for this task. HMR is a model developed to perform prediction on 3D human poses and shapes from a single image.

4. Tool Application: ChatHuman uses the input image and passes it to the HMR tool. Here is how the HMR tool works:

i. Input Processing: The 2D image is taken as input to the tool.

ii. Pose Estimation: This is an image processing system with a neural network trained for human pose recognition. The network predicts 3D coordinate points of key parts of the human body, like the shoulders, elbows, hips, knees, etc.

iii. Mesh Generation: From the 3D coordinates, the tool generates a mesh (3D model) of the human body in order to obtain a detailed estimation of the pose and shape of the person.

5. Tool Output: The HMR tool outputs a 3D mesh of the person, which contains pose and shape parameters.

6. Result Discrimination: ChatHuman verifies the reliability of the output received from the tool. If multiple tools are applied, the tool compares the results and selects the best one.

7. Response Generation: ChatHuman combines the output obtained from the chosen tool with its general knowledge and produces a response. This may include:

A textual description of the estimated pose (e.g., “The person is standing with their left arm raised”).
A visual representation, such as a 3D model of the estimated pose superimposed on the input image.

However, the main challenges are how to handle these diverse output formats including images, text, and 3D parametric meshes like digital models of objects or people that are made up of tiny points, lines, and surfaces. For example, you can adjust the parameters to change the shape, size, or texture of the object.

How does it work?

This model takes motivation from a human approach:

First, people read related papers and understand the techniques that help them to know when and how to use certain tools. Second, they evaluate results to determine whether their output is reliable or compare performance with many similar tools. Finally, they use the best one.

Two key steps involved as follows:

1. Paper-based Retrieval Augmented Generation (RAG) Tool

Teaching LLMs to accurately discern when and how to utilize tools presents a significant challenge. Some tools have various usage scenarios and require background knowledge for proper utilization. For instance, for the HMR tool, relevant inquiries might include “Can you estimate the pose of this person?”, “What are the parameters that need to be set?”, or “I want to get the 3D mesh of this person.”. Furthermore, several tools mean that prompts get longer and more complex which makes it harder for LLMs to use different models especially if it was not trained.

Figure 2: Paper-Based RAG tool. Article link: https://arxiv.org/abs/2405.04533 (Lin et al., 2024)

To overcome these challenges, paper-based RAG mechanisms were introduced as illustrated in Figure 2. First, input the academic paper associated with each tool to GPT-4 which covers extensive background and detailed instruction, enabling user queries to cover a wide range of application scenarios. During this process, LLM accesses various parts of these papers and shows that “reading the paper” improves tool use performance. Furthermore, it analyzes which sections of the paper are most valuable for instructing tool use. Additionally, when people come across a new tool, they usually check the user guide for help. Similarly, documentation for these tools utilises a paper-based Retrieval-Augmented Generation (RAG) mechanism to improve LLM’s understanding and management of new tools. This means that although the LLM has not encountered such tools during fine-tuning, it can still effectively use the tools with the aid of the paper-based RAG model.

2. Tool Result Discrimination and Integration

After the above method, it is equally important to analyze and integrate outcomes to solve problems. However, output from various tools comes in different forms (e.g., language, Image, vectors like SMPL poses). To leverage these results and enhance LLM’s understanding of 3D humans, a tool-conditioned transformation that converts tool outcomes into either text or visual format was introduced.

For instance, DECO identifies exact contact points on a person’s body in an image. Using the SMPL model, these points are grouped into specific body parts. The final text-format result specifies which body parts are in contact with objects. This transformation helps to provide a more detailed and easy-to-understand description of the contact regions detected by DECO in the context of human body parts. Similarly, the mesh created by PoseScipt is turned into an image. These results, combined with the user’s question guide the agent’s response about 3D humans. In this scenario, where several tools can handle a user’s request, the LLM chooses the best option by presenting multiple choices, helping it select the most suitable outcome for the user query.

Figure 3: ChatHuman method overview pipeline. Article link: https://arxiv.org/abs/2405.04533 (Lin et al., 2024)

A user query can be in the form of text descriptions, images or other 3D information (if applicable), the multimodal LLM-based agent adopts a paper-based RAG mechanism to determine whether to employ tools and identify the optimal way to utilize them. After applying the tools, the tool results are transformed into a text or visual format via tool-conditioned transform and fed back to the agent to formulate responses. ChatHuman is the first model that provides a solution to this problem of formats by finetuning LLM as an agent to call appropriate tools in response to user input.

Evaluation of 3D Pose Estimation

Here experiments are conducted on human-related tasks by comparing ChatHuman with the State-of-the-Art (SOTA) task-specific tools. For pose estimation tasks, two SOTA methods like ChatPose and HMR 2.0 were selected. The ChatPose and HMR 2.0 failed to identify the correct pose.

Figure 4. Qualitative comparison with ChatPose, HMR 2.0 for reasoning-based human pose estimation on Speculative Pose Generation (SPG) benchmark. Article link: https://arxiv.org/abs/2405.04533 (Lin et al., 2024)

Evaluation of paper-based RAG mechanism:

Table 1: Paper-based RAG mechanism. Successful rate of thought (SRt), action (SRact), arguments (SRargs), execution (SR), and IoU are reported.

Table 1 shows how effectively the system selects and uses tools to solve tasks. Higher values indicate better performance in understanding the task, choosing the right tool, providing the right inputs, and achieving accurate results.

Evaluation of Integration and Discrimination Tool

Fig. 6: Illustration of how the multimodal LLM-based agent discriminates and integrates tool results. The Agent will fix the unreasonable tool result and integrate the reasonable tool result to generate a final response. Article link: https://arxiv.org/abs/2405.04533 (Lin et al., 2024)

1. Unreasonable Tool Output: Here, the model identifies and corrects an output that is not plausible or reasonable. For example, if the tool provides an inaccurate estimate of a person’s height or weight, ChatHuman intervenes to adjust these measurements to more realistic values based on additional context or better understanding.

2. Reasonable Tool Output: In this scenario, the model validates and integrates a reasonable output from the tool without making substantial changes. This demonstrates the system’s ability to recognize when the tool’s response is already satisfactory and doesn’t require further modification.

Figure 7: Human interaction can improve performance and tool usage accuracy. Article link: https://arxiv.org/abs/2405.04533 (Lin et al., 2024)

Limitations

The ChatHuman system particularly highlights challenges when dealing with vague or incomplete user requests. These limitations are illustrated in Figure 7, which shows examples of how user interactions can help refine and correct the system’s outputs. In simple terms, Figure 7 demonstrates that while ChatHuman can process and respond to user inputs effectively, its performance might suffer if the initial input from the user is unclear or lacks specific details. The figure shows how subsequent clarifications or additional information provided by the user can significantly enhance the accuracy of the system’s responses.

Applications

1. Augmented Reality (AR) and Virtual Reality (VR): ChatHuman can enhance AR and VR experiences by providing more accurate and interactive human models. This could be used in gaming, training simulations, or virtual social interactions, where realistic human avatars are essential.

2. Gaming: The system can be used to develop more lifelike and responsive characters in video games, improving player engagement and the overall gaming experience by integrating more realistic human-like interactions and behaviours.

3. Fashion Industry: In fashion, ChatHuman could assist in designing clothes by providing accurate body shape and size estimations. It can help in customizing designs to fit specific body dimensions or creating virtual try-ons for online shopping platforms.

Conclusion

ChatHuman is an LLM-based model that is designed to learn and use tools to solve 3D human-related tasks. It can understand the user query and analyse those queries to use appropriate tools to provide solutions. There are two key features: First, paper-based RAG tools are responsible for enhancing performance, especially untrained models. Second, Discrimination and integration tools opt for the best outcomes from various outcomes. Evaluations of ChatHuman not only surpass previous models in tool usage accuracy but also enhance performance on a variety of 3D human-related tasks. In addition, understanding 3D humans can benefit fields like AR and VR, Gaming, Fashion industry, and entertainment. However, there are also potential downsides, such as privacy issues, body shaming, and the creation of deep fakes. So, it’s necessary to implement safeguards to prevent ethical issues.

References

Lin, J., Feng, Y., Liu, W., & Black, M. J. (2024, May 7). ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning. arXiv.org. https://arxiv.org/abs/2405.04533
Feng, Y., Feng, H., Black, M. J., & Bolkart, T. (2020, December 7). Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. arXiv.org. https://arxiv.org/abs/2012.04012
Tripathi, S., Chatterjee, A., Passy, J. C., Yi, H., Tzionas, D., & Black, M. J. (2023, September 26). DECO: Dense Estimation of 3D Human-Scene Contact In The Wild. arXiv.org. https://arxiv.org/abs/2309.15273
Kanazawa, A., Black, M. J., Jacobs, D. W., & Malik, J. (2017, December 18). End-to-end Recovery of Human Shape and Pose. arXiv.org. https://arxiv.org/abs/1712.06584
Hong, F., Zhang, M., Pan, L., Cai, Z., Yang, L., & Liu, Z. (2022, May 17). AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. arXiv.org. https://arxiv.org/abs/2205.08535
Müller, L., Osman, A. A. A., Tang, S., Huang, C. H. P., & Black, M. J. (2021, April 7). On Self-Contact and Human Pose. arXiv.org. https://arxiv.org/abs/2104.03176
Choutas, V., Muller, L., Huang, C. H. P., Tang, S., Tzionas, D., & Black, M. J. (2022, June 14). Accurate 3D Body Shape Regression using Metric and Semantic Attributes. arXiv.org. https://arxiv.org/abs/2206.07036
Li, Z., Liu, J., Zhang, Z., Xu, S., & Yan, Y. (2022, August 1). CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation. arXiv.org. https://arxiv.org/abs/2208.00571
Danecek, R., Black, M. J., & Bolkart, T. (2022, April 24). EMOCA: Emotion Driven Monocular Face Capture and Animation. arXiv.org. https://arxiv.org/abs/2204.11312
O., Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., . . . Zoph, B. (2023, March 15). GPT-4 Technical Report. arXiv.org. https://arxiv.org/abs/2303.08774
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023, April 17). Visual Instruction Tuning. arXiv.org. https://arxiv.org/abs/2304.08485
Feng, Y., Lin, J., Dwivedi, S. K., Sun, Y., Patel, P., & Black, M. J. (2023, November 30). ChatPose: Chatting about 3D Human Pose. arXiv.org. https://arxiv.org/abs/2311.18836