Abstract

Human–robot collaboration (HRC) has become an integral element of many manufacturing and service industries. A fundamental requirement for safe HRC is understanding and predicting human trajectories and intentions, especially when humans and robots operate nearby. Although existing research emphasizes predicting human motions or intentions, a key challenge is predicting both human trajectories and intentions simultaneously. This paper addresses this gap by developing a multi-task learning framework consisting of a bi-long short-term memory-based encoder–decoder architecture that obtains the motion data from both human and robot trajectories as inputs and performs two main tasks simultaneously: human trajectory prediction and human intention prediction. The first task predicts human trajectories by reconstructing the motion sequences, while the second task tests two main approaches for intention prediction: supervised learning, specifically a support vector machine, to predict human intention based on the latent representation, and, an unsupervised learning method, the hidden Markov model, that decodes the latent features for human intention prediction. Four encoder designs are evaluated for feature extraction, including interaction-attention, interaction-pooling, interaction-seq2seq, and seq2seq. The framework is validated through a case study of a desktop disassembly task with robots operating at different speeds. The results include evaluating different encoder designs, analyzing the impact of incorporating robot motion into the encoder, and detailed visualizations. The findings show that the proposed framework can accurately predict human trajectories and intentions.

1 Introduction

Human–robot collaboration (HRC) is a rapidly growing area of research and application. It plays an important role in various sectors, including, but not limited to, manufacturing. Understanding and predicting human intentions is a critical aspect of the successful implementation of HRC. It equips robots with the ability to interpret and respond to their human counterparts in a timely manner and fosters practical collaboration. Understanding human activities provides solutions that benefit the HRC. Specifically, predicting the next positions of human motions and recognizing their intended action goals can significantly improve the HRC safety.

Although research efforts have been made to predict human intentions and trajectories individually, there remains a critical gap in the concurrent prediction of these elements. Practical applications require equipping robots with the ability to accurately interpret human actions through a signal learning framework. Multi-task learning (MTL) addresses this need by sharing representations between related tasks and facilitating knowledge transfer [1].

The objective of this paper is to develop a framework for multi-task learning and compare several encoder architectures for simultaneous prediction of human intention and movements in HRC tasks. The proposed MTL framework is illustrated in Fig. 1. This model processes motion data from both human and robot trajectories. These inputs are directed into an encoder module that analyzes the sequences to generate a latent representation of the data. From the latent space, two separate tasks are branched out. The first task employs a decoder to predict human trajectories through reconstructing the motion sequences. The second task uses supervised learning, specifically a support vector machine (SVM), to predict human intention based on the latent representation. Moreover, an unsupervised learning method, the hidden Markov model (HMM), is utilized for human intention prediction, which offers a different approach to decoding the latent features. Each task aims to decode a different aspect of the input sequences, one for the movement trajectory and the other for the intended path.

Fig. 1
The proposed framework for prediction of human intent and trajectory through multi-task learning
Fig. 1
The proposed framework for prediction of human intent and trajectory through multi-task learning
Close modal

Several analyses have been conducted based on the proposed framework. We have compared the performance of four different encoder architectures, which represent various feature integration methods such as attention mechanisms, pooling functions, and non-transformation. Moreover, we have conducted experimental studies and collected data by implementing two-speed modes in a human–robot collaborative disassembly task: high-speed and low-speed robot movements. This design tests the adaptability and robustness of the proposed models under varying operational conditions. Also, we visualized the latent representations of the models to evaluate their capability to capture multiple intention feature classes. In addition, we examined the inference performances of the models, including the number of parameters and inference time, to analyze model performance with computational cost.

The paper is structured as follows: Sec. 2 reviews relevant literature. Section 3 outlines the methodology used to create the framework. Section 4 presents a case study to evaluate the framework in human–robot collaborative disassembly tasks. Finally, Sec. 5 concludes the paper.

2 Related Work

2.1 Human Intention Prediction Approaches in Human–Robot Collaboration.

In human–robot teamwork, the ability to anticipate each other’s actions greatly improves task coordination among team members. Humans can exchange information through direct methods, such as gestures and words, and indirect methods, such as facial expressions and internal assumptions. Robots need to infer human intentions to collaborate comparably. Researchers have made efforts to develop robots capable of predicting human intentions. To name several examples, Fan et al. [2] suggested identifying human intentions by preprocessing body postures and evaluating them in an HRC disassembly scenario. Margrini et al. [3] introduced the recognition of the operator’s gestures to understand the human intent and to control a robot for collaborative polishing operations.

Achieving a semantic understanding of human intent specific to the task at hand for robots is a high-level requirement. This level of understanding requires advanced algorithms to interpret human activities within the specific work. Some recent work has shown the capabilities of recurrent neural networks (RNNs) and Transformers. For example, the robot’s motion and the operator’s contact forces were evaluated using a long short-term memory network (LSTM), and it inferred whether the operator intends to assist or resist the manipulator’s movement toward a target point [4]. The accuracy of this inference ranged from 73.87% to 86.49%. In addition, a deep-learning model consisting of LSTM processed virtually generated human motion and force sequences to predict subsequent hand movement intentions with accuracies between 90% and 100% [5]. Also, Transformer architecture has been used to predict human intention in an HRC scenario [6]. However, these approaches often do not consider incorporating robot features, particularly in HRC scenarios. Without fully considering robotic participation, there is a potential impact on the accuracy of predicting human intent.

2.2 Human Trajectory Prediction Approaches in Human–Robot Collaboration.

Predicting human movement is an important aspect of improving the safety of robotic systems. Previous studies have demonstrated the importance of capturing human motion in existing HRC developments. For example, Liu et al. [7] used predictive human motion to calculate the minimum safe distance between the human and the manipulator, thereby addressing motion uncertainty. Katsampiris-Salgado et al. [8] predicted the operator’s trajectory for a slightly longer duration to account for potential delays in the robot’s movement. Based on this information, they calculated a robot trajectory that avoids intersecting with the operator’s future path. However, predicting human motion extends beyond safety considerations and is integral to the development of path planning for robots. Xiao et al. [9] highlighted that analyzing disassembly trajectories is beneficial for determining optimal disassembly paths.

Existing studies describe a wide range of methodologies in human motion learning. To name a few, Yang and Howard [10] used an optimization-based posture model to predict human initial and final task motions. However, human–robot interaction is continuous; thus, complete motion prediction better fits realistic requirements. Abuduweili et al. [11] predicted human trajectories using gated recursive units, which are known for their ability to gate and retain relevant information over time. Tian et al. [12] implemented an inverse reinforcement learning method to reduce noises in human movement patterns. Zhou et al. integrated an attention mechanism into trajectory estimation, achieving higher accuracy compared to RNN methods [13]. These advancements have made designing different motion-feature learning methods and further improving trajectory prediction important.

2.3 Multi-Task Learning Applications in Human–Robot Collaboration.

In the previous subsections, we have discussed the importance of identifying human intent and trajectory. A considerable limitation of existing studies is their tendency to concentrate on either intent or trajectory independently rather than predicting them simultaneously. Robots can combine both aspects and comprehend both the actions a human is about to take and anticipate the likely paths of human movement. To achieve this, multi-task learning is a practical approach, using a single learning architecture to perform multiple tasks jointly [14].

The concept of MTL has been put into practice in HRC scenarios. To name several studies, Cai et al. [15] developed a multi-task deep-learning framework capable of performing action classification, object classification, and body movement prediction. In their study on human–robot handover interactions, Liu et al. [16] combined convolutional neural networks and LSTM networks to accomplish the prediction of human intent and hand-object detection. Despite these advances, there are still few applications of MTL in HRC settings and remanufacturing. Moreover, existing research focuses on performing different types of tasks rather than exploring the application of various types of learning, such as supervised and unsupervised learning. This represents an underexplored area. Also, in the HRC domain, dynamic robot movement is a frequent occurrence. The implications of these dynamic conditions on the predictive performance of MTL models require detailed examination.

When designing MTL structures, several factors are considered, including the integration of task-specific and generic modules and the allocation of model parameters in multiple tasks. According to Ref. [17], shared MTL architecture design (i.e., encoding) impacted information learning processes, which subsequently affects estimations in different tasks. Standley et al. [18] also discussed the benefits of different encoder architectures in MTL for balancing performance. In our research, we developed a shared architecture for learning sequences by proposing various information extractor mechanisms. Given that the MTL architecture consists of two primary tasks, human trajectory prediction and human intention classification, our approach uses recent sequential-based methods for decoding trajectories and employs conventional SVM and HMM for intention classification within a deep learning architecture.

3 Methodology

This section introduces the general workflow for predicting human intentions and trajectories. The process begins with selecting a bi-LSTM-based encoder–decoder architecture for handling sequential data. Four different encoder designs are evaluated for feature extraction. Also, the methods for intent classification, both supervised and unsupervised, are discussed. Finally, the objective function used for model updating is explained. The following subsections provide detailed descriptions of each component.

3.1 Bi-LSTM Encoder–Decoder Architecture.

Sutskever et al. [19] first proposed the encoder–decoder architecture and applied it for neural machine translation, which was designed to convert sequences from one domain into sequences in another domain. The encoder model encodes the input sequence as a convex vector and aims to learn the information in the input. After the encoder, the decoder takes the convex vector and generates the output. The encoder or decoder is usually designed as a recurrent neural network, such as an LSTM or gated recursive unit.

The workflow of an encoder–decoder architecture consisting of bi-LSTM networks is shown in Fig. 2. The LSTM cell comprises a current input vector xt, the last memory cell state ct1, and the last hidden state ht1. How each LSTM cell operates is explained mathematically as
(1)
(2)
(3)
(4)
(5)
where ft, it, and ot are namely the forget gate, input gate, and output gate. represents element-wise vector multiplication. w and b are linear transformation matrices and biases.
Fig. 2
Architecture of the bi-LSTM network
Fig. 2
Architecture of the bi-LSTM network
Close modal
While the standard LSTM model processes data in a sequence focusing on the current and past information, we select a bi-LSTM model that considers past and future data points. This enhancement is achieved by adding an extra layer to the LSTM network. In a bi-LSTM, there are two key layers: the forward hidden layer htf and the backward hidden layer htb [20]. The forward hidden layer processes the input in the natural order, starting from the first element and moving forward. On the other hand, the backward hidden layer processes the input in reverse order, beginning from the last element and moving backward. The final output Ht is produced by combining the outcomes from the forward and backward hidden layers. The implementation of the bi-LSTM model is based on the following equations:
(6)
(7)
(8)

3.2 Different Encoder Designs.

In the encoder–decoder architecture, the encoder converts the input sequence into a hidden representation, and then the decoder changes this representation back into an output sequence. A possible issue with this method is the risk of losing information during the process. We have added attention and pooling mechanisms to the encoder to address this issue. Four different designs of the encoder are shown in Fig. 3.

Fig. 3
Different encoder architectures sharing a common decoder model: (a) interaction-attention, (b) interaction-pooling, (c) interaction-seq2seq, and (d) seq2seq
Fig. 3
Different encoder architectures sharing a common decoder model: (a) interaction-attention, (b) interaction-pooling, (c) interaction-seq2seq, and (d) seq2seq
Close modal
The interaction-attention encoder (shown in Fig. 3(a)) processes robot and human sequences, directing their outputs to an attention layer. An attention mechanism was introduced that allows the network to focus on the most relevant information [21]. The output sequences from the bi-LSTM network, denoted by H=[H1,,Ht,,HT]RN×T, where N represents the dimensionality of the output feature vector at each time-step, and T represents the number of time-steps. An attention mechanism processes these sequences to compute alignment scores, indicated by ekt+1,,ekt+m. These scores quantify the relevance of each input element to the current output element being considered. Attention weights are then derived using a softmax function over these scores
(9)
This formulation enforces that the attention weights sum to 1, permitting the model to allocate focus adaptively in different segments of the input sequence for each output time-step. Subsequent to the calculation of attention weights, the vectors are output by
(10)

The interaction-pooling encoder (shown in Fig. 3(b)) processes sequences from both robot and human, directing the outputs toward a layer specifically designed for pooling operations. To focus on the most relevant information without the need for complex attention weight calculations, we employ both average and max pooling strategies [22].

For average pooling, the operation is defined as taking the mean of the specified dimensions of the sequence, aiming to generalize the overall trend or average effect present in the sequence. The equation for average pooling over the temporal dimension is given by
(11)

This results in a vector havg that provides a summarized representation by averaging the features of all time-steps, thereby condensing the temporal information into a single vector.

On the other hand, max pooling is utilized to capture the most dominant features in the sequence by selecting the maximum value of dimensions for each feature. The operation is formalized through the following equation:
(12)

Here hmax denotes the vector consisting of the largest feature values identified in all time-steps.

The interaction-seq2seq encoder, illustrated in Fig. 3(c), processes both human and robot motion sequences. The output vectors are directly derived from the bi-LSTM without further transformation
(13)

This design indicates a direct mapping from input to output, handling sequence data without additional processing layers or functions.

To establish a baseline model, the seq2seq encoder, shown in Fig. 3(d), exclusively processes human motion sequences. We aim to compare whether including robot motions can enhance the prediction of human motion, given the absence of any specialized information extraction function
(14)

This formulation also represents a direct mapping of human motion from input to output.

3.3 Supervised and Unsupervised Intention Classification.

We employ SVM for supervised classification due to its performance as a discriminative classifier. Our specific bi-LSTM-based encoder architecture, which incorporates multiple functional components, is specifically designed to prioritize learning during the encoder stage. Also, due to the need to compare with unsupervised learning classification methods, we selected SVM with a radial basis function kernel as it is a lightweight classifier. SVM identifies the optimal hyperplane to separate different classes based on the labeled sequential data within our framework.

In addition, we select an HMM model for unsupervised recognition. HMM is selected as an appropriate method for unsupervised sequence clustering. HMMs are stochastic processes that model probabilistic transitions between hidden states over time. Compared to other methods such as Gaussian mixture models or variational autoencoders, which focus on the distribution of random variables, HMMs are particularly helpful in capturing the temporal dynamics present in sequential data and clustering temporal dependencies [23].

An HMM consists of a set of discrete hidden states St{1,2,,K} and observation sequences {z1,z2,,zN}, ziRd. K is the number of hidden states, and d is the observation dimension.

In the HMM, there are three sets of parameters. The state transition matrix ARK×K where each entry Ai,j represents the probability to switch from state i to state j
(15)
The emission matrix BRK×d, where Bi,j typically represents the parameters of the observation distribution in state j
(16)

The expectation-maximization (EM) algorithm is utilized to infer the hidden intention states and refine the model parameters based on the observed motion sequences. In the expectation (E) step, the posterior distributions are calculated using the forward–backward algorithm. In the maximization (M) step, the parameters A and B are updated based on the calculated distributions. After the convergence of the iterative process, the sequence of hidden states predicted by the HMM can be interpreted as the sequence of underlying intentions.

3.4 Objective Function.

In this model, mean squared error (MSE) is used as the objective function to compare predicted motion positions with true labels. Cross-entropy is excluded due to the unsupervised nature of intent classification. Therefore, the model only considers trajectory prediction error to optimize the results.
(17)
where yt and yt^ denote actual and predicted values.

4 Experiment and Results

This section presents a case study used for data collection and model evaluation. We designed a human–robot collaboration experiment focused on disassembling a desktop. In addition, we compared the performance of various encoder architectures.

4.1 Experimental Design and Dataset.

We designed a human–robot collaborative disassembly experiment to gather data for testing the proposed framework. In this setup, the human and the robot stand face-to-face to disassemble an end-of-life desktop computer. The human operator is tasked with removing two screws located on the left and right sides of the desktop, while the robot is assigned to pick up a disassembled CD drive near the right screw, as illustrated in Fig. 4.

Fig. 4
Experimental setup and disassembly task assignment
Fig. 4
Experimental setup and disassembly task assignment
Close modal

The experiment aimed to investigate the impact of the collaborative robot on the human worker’s decision-making and corresponding movements. We tested two velocities for the robot: fast and slow. For this experiment, the robot’s end-effector velocity was set to 0.5m/s for the fast speed and 0.08m/s for the slow speed. The exact velocities are chosen to give the human enough time or not to disassemble the right screw first. Each time, the robot is randomly set to move at either speed to pick up the CD drive. So human operator does not have any prior knowledge of which robot speed mode they are encountering in each trial, and needs to make decisions based on robot actions. When the robot moved fast, the human did not have sufficient time or space to release the right screw safely. To avoid collisions, the human operator would first remove the left screw and wait for the robot to move away before releasing the right screw. On the other hand, when the robot moved slowly, the human worker felt confident to complete the screw disassembly before the robot arrived. In the absence of safety concerns, the human would remove the right-hand screw first, following the preference of the right hand.

We used the Vicon motion capture system to track the movement of the human operator’s right arm. To determine the positions of the rotating joints, we placed two markers on each side of the shoulder, elbow, and wrist. The positions of the joints are estimated by averaging the positions of the corresponding markers. Data are recorded at a frequency of 50 Hz. We collected 60 trials for each robot speed setting and divided the data into training and test datasets in a 2:1 ratio. Some variations, such as motion trajectory, speed, and acceleration, occurred in different trials for a particular speed setting. Acknowledging that using the data from a single human work may limit the model’s generalizability to different operators, our current experimental setup helps us focus on improving prediction accuracy for a specific human operator using our multi-task learning framework. One can always create datasets for different operators in various scenarios, as the data collection process is fast since our model does not require much data for training.

We measured the average time for the human worker to remove two screws at different robot speed settings. In the fast-speed scenario, the human operator takes 16.36 s (818 time-steps at 50 Hz) to remove two screws, whereas in the slow robot speed case, it takes 14.32 s (716 time-steps at 50 Hz). In the slower robot speed case, the human operator completes the task more quickly. This is because, at the faster speed setting, the operator finishes removing the left screw before the robot completes its task, which causes a delay to avoid a potential collision.

Training samples consist of 51,222 sequences, and test samples comprise 27,424 sequences. A total of nine features are included for human motion sequences, and an additional 21 features collected from the robotic arm are incorporated for model learning. Input data are normalized to maintain uniformity among all features. Sequences are padded to a fixed length to accommodate batch processing. We utilize the Adam optimizer for its adaptive learning rate capabilities, which facilitate convergence during training. The learning rate is initially set to 0.0001. Data are divided into batches of 256 samples to efficiently utilize computational resources. Four NVIDIA GeForce RTX 3060Ti GPUs are used to run the model. To prevent overfitting and achieve generalizable performance, we set five random seeds and tested the model performance on each random seed.

4.2 Results on Different Test Sets.

We train the models using all collected training data under the two-speed modes of the manipulator. We then test the models on three different test sets: fast-speed, slow-speed, and overall. Results are introduced and compared based on these test sets. Given the limited data size, we use all motion trials from the training set to obtain pre-trained models. Evaluating these pre-trained models on diverse test sets helps us assess the performance of different feature integration methods.

We select the MSE and the coefficient of determination (R2) as criteria to evaluate the trajectory prediction results. For classification results, we use accuracy as the primary metric. The results, presented as mean and standard deviation values, are summarized in Table 1.

Table 1

Model performance across different test sets

Trajectory (MSE)Trajectory (R2)Classification-SVMClassification-HMM
Fast-speed set
Seq2seq0.10±0.060.87±0.110.79±0.010.64±0.07
Interaction-seq2seq0.07±0.030.92±0.050.81±0.020.70±0.07
Interaction-average pooling0.11±0.080.88±0.110.83±0.040.65±0.09
Interaction-max pooling0.08±0.040.91±0.050.82±0.020.64±0.03
Interaction-attention0.06±0.030.93±0.040.81±0.030.74±0.03
Slow-speed set
Seq2seq0.09±0.040.88±0.060.85±0.000.74±0.02
Interaction-seq2seq0.09±0.040.88±0.080.85±0.010.65±0.03
Interaction-average pooling0.10±0.050.86±0.080.86±0.020.64±0.02
Interaction-max pooling0.09±0.040.88±0.070.86±0.010.63±0.04
Interaction-attention0.07±0.030.91±0.050.85±0.020.66±0.02
Overall set
Seq2seq0.09±0.050.88±0.080.77±0.000.51±0.02
Interaction-seq2seq0.08±0.040.90±0.050.84±0.010.53±0.02
Interaction-average pooling0.10±0.060.87±0.090.85±0.020.52±0.03
Interaction-max pooling0.08±0.040.90±0.050.84±0.010.54±0.01
Interaction-attention0.06±0.030.92±0.040.82±0.010.51±0.05
Trajectory (MSE)Trajectory (R2)Classification-SVMClassification-HMM
Fast-speed set
Seq2seq0.10±0.060.87±0.110.79±0.010.64±0.07
Interaction-seq2seq0.07±0.030.92±0.050.81±0.020.70±0.07
Interaction-average pooling0.11±0.080.88±0.110.83±0.040.65±0.09
Interaction-max pooling0.08±0.040.91±0.050.82±0.020.64±0.03
Interaction-attention0.06±0.030.93±0.040.81±0.030.74±0.03
Slow-speed set
Seq2seq0.09±0.040.88±0.060.85±0.000.74±0.02
Interaction-seq2seq0.09±0.040.88±0.080.85±0.010.65±0.03
Interaction-average pooling0.10±0.050.86±0.080.86±0.020.64±0.02
Interaction-max pooling0.09±0.040.88±0.070.86±0.010.63±0.04
Interaction-attention0.07±0.030.91±0.050.85±0.020.66±0.02
Overall set
Seq2seq0.09±0.050.88±0.080.77±0.000.51±0.02
Interaction-seq2seq0.08±0.040.90±0.050.84±0.010.53±0.02
Interaction-average pooling0.10±0.060.87±0.090.85±0.020.52±0.03
Interaction-max pooling0.08±0.040.90±0.050.84±0.010.54±0.01
Interaction-attention0.06±0.030.92±0.040.82±0.010.51±0.05

Table 1 emphasizes how variations in encoder design influence performance, including seq2seq, attention, and pooling methods. Each method has different impacts on the model’s ability to predict trajectory and classification tasks between both supervised and unsupervised learning paradigms.

First, we compare the performance of trajectory prediction. The interaction-attention model consistently demonstrates superior performance in different test sets. This is due to the attention mechanism’s ability to capture important temporal relationships in time-series data. For supervised classification performance, the pooling function outperforms the attention mechanism. Although the attention mechanism is robust for sequence prediction, its advantage may be decreased in sequence classification tasks if the training data size is insufficient. For unsupervised classification, different models exhibit varying levels of perdition accuracy. This variability is expected since the classification task was not included in the loss function, and labels are absent.

Also, we compare the performance between encoder models. With the introduction of multiple feature integration encoders, we conduct detailed model comparisons. Integrating robot motions significantly improved the prediction of human motions when comparing seq2seq and interaction-seq2seq models. This improvement is particularly valuable in human–robot collaboration settings and shows the importance of considering robot motion, even when the robot follows a predefined path. Evaluating various pooling functions reveals that max pooling is better suited to our scenario. Furthermore, when comparing the max pooling model with the attention model, the attention mechanism demonstrated superior performance in trajectory prediction.

4.3 Results of Trajectory Plots.

We display the predicted trajectories from the interaction-attention model to visualize the trajectory results, which indicates the best prediction performance. This model uses the past 50 time-steps to predict the future 50 time-steps. Predictive human trajectories are projected onto 3D coordinates in meters. Figure 5 illustrates two randomly selected predicted trails, each consisting of 50 time points, with each joint plotted separately.

Fig. 5
Results of the interaction-attention model for visualizing human trajectories
Fig. 5
Results of the interaction-attention model for visualizing human trajectories
Close modal

Compared to joints such as the elbow and wrist, the shoulder joint indicates less motion variation, which increases the difficulty of predicting its trajectories. To address this, human shoulder trajectories are particularly visualized in Fig. 6. The predicted trajectories from the interaction-attention model closely align with the true trajectories with minimal deviation. On the other hand, the seq2seq model demonstrates the largest discrepancies between predicted and true trajectories, with deviations throughout the movement path.

Fig. 6
Comparisons of various encoders for visualizing human shoulder trajectories
Fig. 6
Comparisons of various encoders for visualizing human shoulder trajectories
Close modal

4.4 Results of Classification Plots.

Figure 7 illustrates the classification heatmaps of the interaction-attention model, where the classification accuracy for each intent class is listed. For clarity, detailed descriptions of the intent classes are provided in Table 2.

Fig. 7
Results of the interaction-attention model for human intents prediction heat-maps
Fig. 7
Results of the interaction-attention model for human intents prediction heat-maps
Close modal
Table 2

Motion description for intent labeling

Intent labelMotion description
1Move from the standby position to the right screw
2Transition from the right screw to the left screw
3Move from the left screw to the standby position
4Move from the standby position to the left screw
5Transition from the left screw to the right screw
6Move from the right screw to the standby position
7Unfasten the right screw
8Unfasten the left screw
Intent labelMotion description
1Move from the standby position to the right screw
2Transition from the right screw to the left screw
3Move from the left screw to the standby position
4Move from the standby position to the left screw
5Transition from the left screw to the right screw
6Move from the right screw to the standby position
7Unfasten the right screw
8Unfasten the left screw

The SVM model performs consistently well in individual classes, with some predictions achieving 100% accuracy, such as labels 1 and 6, which represent initial and end motions during disassembly. The HMM model, however, shows varying accuracy for each intent class. For example, it predicts the beginning motion (label 1) with high accuracy but struggles with end motions (label 3). This shows that while HMM processes initial motion sequences well, its capability reduces with continuous motion changes. A major difference between supervised and unsupervised models is that the SVM succeeds in predicting small movements, such as unfastening screws (labels 7 and 8), whereas the HMM performs better with large-scale movements, such as transitions between positions.

In addition, we evaluate two models, interaction-max pooling model and interaction-average pooling model, for their performance in supervised and unsupervised prediction of human intentions, presented in Fig. 8. These two models were chosen as they outperformed interaction-seq2seq and seq2seq models regarding intention prediction. The results show that SVM consistently outperforms HMM in both pooling strategies, despite some misclassification for labels 5 and 7.

Fig. 8
Comparison of SVM and HMM in intention prediction when using (a) interaction-max pooling and (b) interaction-average pooling
Fig. 8
Comparison of SVM and HMM in intention prediction when using (a) interaction-max pooling and (b) interaction-average pooling
Close modal

4.5 Results of Latent Representations.

The latent space shows a compressed and abstract representation of the features learned through the encoding process. Specifically, the latent vectors encode temporal and spatial patterns derived from human and robotic motion sequences.

Latent representations help evaluate the model’s ability to uncover the underlying structure of the data during the learning process. Common methods for presenting latent representations include calculating Euclidean distances [24] between latent vectors and visualizing latent space distributions [25]. In our study, we employ a visualization approach to differentiate latent feature distributions according to human intent labels. Features within the same cluster should be closely packed together, while features from different clusters should be well-separated.

To display the latent representations, we apply principal component analysis to reduce the original 64-dimensional latent space to three dimensions. This dimensionality reduction technique enables us to project high-dimensional data into a more interpretable form while preserving as much variance as possible. Moreover, to assess the model’s learning consistency, we visualize the latent space of each proposed model during both the training and testing phases.

The attention-based latent representation shows well-distributed classes, indicating that the model has captured the distinctive features of different classes, shown in Fig. 9. Moreover, the latent spaces for both training and testing data display similar patterns, suggesting that the attention mechanism generalizes well for different data subsets. This consistency between train and test distributions validates the robustness of the attention-based approach in feature extraction.

Fig. 9
Performance of interaction-attention in latent space
Fig. 9
Performance of interaction-attention in latent space
Close modal

Figure 10 presents the latent space visualizations for the interaction model using max pooling (Fig. 10(a)) and average pooling (Fig. 10(b)) strategies. The max pooling approach generates more well-separated clusters, where the latent vectors are spread-out in the 3D space. On the other hand, average pooling results in compact clusters.

Fig. 10
Comparisons of pooling strategies in latent space: (a) interaction-max pooling and (b) interaction-average pooling
Fig. 10
Comparisons of pooling strategies in latent space: (a) interaction-max pooling and (b) interaction-average pooling
Close modal

Figure 11 provides the latent space visualizations for interaction-seq2seq and seq2seq models. These plots show the benefit of incorporating robot motion sequences into the feature learning process for human intent prediction. Interaction-seq2seq model shows more well-defined class separations compared to the basic seq2seq model.

Fig. 11
Comparisons of latent space involved in robot motion: (a) interaction-seq2seq and (b) seq2seq
Fig. 11
Comparisons of latent space involved in robot motion: (a) interaction-seq2seq and (b) seq2seq
Close modal

4.6 Results of Human Joint Positions.

As mentioned earlier, Fig. 5 demonstrates trials of each joint individually. These plots help us observe how the model independently predicts trajectories for each joint. However, to evaluate the model’s consistency over extended periods, we concatenated predictions from eight individual test trials based on single trial predictions and displayed them in a single plot. The trajectory plots in Fig. 12 illustrate the movements of each joint in the X, Y, and Z coordinates over eight consecutive test trials. This approach helps us assess the model’s ability to maintain accurate predictions in multiple continuous predictions.

Fig. 12
Comparison of trajectories of joint positions on different speed test sets: (a) trajectories of joint positions on fast-speed test set and (b) trajectories of joint positions on slow-speed test set
Fig. 12
Comparison of trajectories of joint positions on different speed test sets: (a) trajectories of joint positions on fast-speed test set and (b) trajectories of joint positions on slow-speed test set
Close modal

The predicted trajectories closely follow the true trajectories in all joint positions. The x-axis in the plots represents time-steps, and we have concatenated eight test trials together to provide a snapshot of the model’s performance over extended sequences. The predicted joint positions closely aligning with the true positions indicate that the interaction-attention model captures the dynamics of human movement over prolonged sequences, regardless of the speed of the action.

4.7 Results of Model Inference Performance.

This section analyzes model inference performance, which assists in understanding the tradeoffs between model complexity and inference speed and helps users select the most appropriate model.

The number of parameters in each encoder module indicates the model’s size and complexity. A higher parameter count typically signifies a model capable of capturing more intricate patterns in the data. For instance, the interaction-seq2seq model in Table 3 has the highest parameter count at 651.858 k.

Table 3

Comparison of model inference performance

ModelParameters (k)Inference (ms/batch)
Seq2seq384.0180.52±0.007
Interaction-seq2seq651.8580.83±0.009
Interaction-average pooling417.4260.85±0.010
Interaction-max pooling417.4260.85±0.010
Interaction-attention532.4430.92±0.090
ModelParameters (k)Inference (ms/batch)
Seq2seq384.0180.52±0.007
Interaction-seq2seq651.8580.83±0.009
Interaction-average pooling417.4260.85±0.010
Interaction-max pooling417.4260.85±0.010
Interaction-attention532.4430.92±0.090

The inference time, measured in milliseconds per batch, indicates how quickly each model can predict outcomes based on new data. Lower inference times are needed for real-time applications, where speed is essential. The seq2seq model in Table 3 demonstrates the fastest inference time with a mean of 0.52ms/batch and a standard deviation of 0.007 ms. On the other hand, while having a considerable number of parameters (532.443 k), the interaction-attention model shows a slower inference time with a mean of 0.92ms/batch. Attention calculation does not require additional parameters but includes computationally intensive operations to calculate attention weights.

5 Conclusions

In this study, we aim to accurately predict both human intentions and movement paths in HRC settings. This becomes important for safety and productivity when humans and robots share workspaces. We developed a multi-task learning framework that can simultaneously predict a person’s intentions and movement trajectory. Four different encoder architectures are tested within this framework, and we explore both supervised and unsupervised methods for analyzing movement data, with a special focus on capturing the timing of these movements.

We conducted experiments where humans and robots collaboratively disassemble components to collect data and evaluate the performance of the proposed framework. The results demonstrate that the system performs well in predicting both intentions and movements. The latent representations to evaluate how well the models capture important features have been shown, and detailed plots of specific human joint positions under different speed settings have been provided. In addition, we compare the inference times of different encoder designs to assess their efficiency.

The scope of this work can be extended in several directions. First, future work can explore optimizing task sequencing by adjusting the robot’s speed to evaluate whether the multi-task learning model can handle these dynamics. Also, the proposed framework can be applied to more complex multi-robot collaboration scenarios than the current one-to-one partnerships. Moreover, future research can develop more advanced architectures with alternative sequential methods in the encoder and decoder, such as few-shot learning, to facilitate performance comparisons and potential improvements in generalizability. Furthermore, future work includes performing multi-modality prediction by integrating image and sequential data to advance the model’s predictive capabilities. Finally, the work can be extended to provide a detailed analysis of model performance for different joints to assess whether the proposed framework and corresponding encoders provide superior predictions for all joint positions or if their performance varies.

Acknowledgment

This material is based upon work supported by the National Science Foundation–USA under Grant Nos. 2026276 and 2422826. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

References

1.
Zhang
,
Y.
, and
Yang
,
Q.
,
2021
, “
A Survey on Multi-Task Learning
,”
IEEE Trans. Knowl. Data. Eng.
,
34
(
12
), pp.
5586
5609
.
2.
Fan
,
J.
,
Zheng
,
P.
, and
Lee
,
C. K.
,
2023
, “
A Vision-Based Human Digital Twin Modeling Approach for Adaptive Human-Robot Collaboration
,”
ASME J. Manuf. Sci. Eng.
,
145
(
12
), p.
121002
.
3.
Magrini
,
E.
,
Ferraguti
,
F.
,
Ronga
,
A. J.
,
Pini
,
F.
,
De Luca
,
A.
, and
Leali
,
F.
,
2020
, “
Human-Robot Coexistence and Interaction in Open Industrial Cells
,”
Rob. Comput.-Integr. Manuf.
,
61
, p.
101846
.
4.
Cacace
,
J.
,
Caccavale
,
R.
,
Finzi
,
A.
, and
Grieco
,
R.
,
2023
, “
Combining Human Guidance and Structured Task Execution During Physical Human-Robot Collaboration
,”
J. Intell. Manuf.
,
34
(
7
), pp.
3053
3067
.
5.
Yao
,
B.
,
Yang
,
B.
,
Xu
,
W.
,
Ji
,
Z.
,
Zhou
,
Z.
, and
Wang
,
L.
,
2024
, “
Virtual Data Generation for Human Intention Prediction Based on Digital Modeling of Human-Robot Collaboration
,”
Rob. Comput.-Integr. Manuf.
,
87
, p.
102714
.
6.
Zhang
,
X.
,
Tian
,
S.
,
Liang
,
X.
,
Zheng
,
M.
, and
Behdad
,
S.
,
2024
, “
Early Prediction of Human Intention for Human–Robot Collaboration Using Transformer Network
,”
ASME J. Comput. Inf. Sci. Eng.
,
24
(
5
), p.
051003
.
7.
Liu
,
W.
,
Liang
,
X.
, and
Zheng
,
M.
,
2023
, “
Task-Constrained Motion Planning Considering Uncertainty-Informed Human Motion Prediction for Human-Robot Collaborative Disassembly
,”
IEEE/ASME Trans. Mechatron.
,
28
(
4
), pp.
2056
2063
.
8.
Katsampiris-Salgado
,
K.
,
Dimitropoulos
,
N.
,
Gkrizis
,
C.
,
Michalos
,
G.
, and
Makris
,
S.
,
2024
, “
Advancing Human-Robot Collaboration: Predicting Operator Trajectories Through AI and Infrared Imaging
,”
J. Manuf. Syst.
,
74
, pp.
980
994
.
9.
Xiao
,
J.
,
Gao
,
J.
,
Anwer
,
N.
, and
Eynard
,
B.
,
2023
, “
Multi-Agent Reinforcement Learning Method for Disassembly Sequential Task Optimization Based on Human–Robot Collaborative Disassembly in Electric Vehicle Battery Recycling
,”
ASME J. Manuf. Sci. Eng.
,
145
(
12
), p.
121001
.
10.
Yang
,
J.
, and
Howard
,
B.
,
2020
, “
Prediction of Initial and Final Postures for Motion Planning in Human Manual Manipulation Tasks Based on Cognitive Decision Making
,”
ASME J. Comput. Inf. Sci. Eng.
,
20
(
1
), p.
011007
.
11.
Abuduweili
,
A.
,
Li
,
S.
, and
Liu
,
C.
,
2019
, “Adaptable Human Intention and Trajectory Prediction for Human-Robot Collaboration,” arXiv preprint arXiv:1909.05089.
12.
Tian
,
S.
,
Liang
,
X.
, and
Zheng
,
M.
,
2023
, “
An Optimization-Based Human Behavior Modeling and Prediction for Human-Robot Collaborative Disassembly
,”
2023 American Control Conference (ACC)
,
San Diego, CA
,
May 31–June 2
, pp.
3356
3361
.
13.
Zhou
,
H.
,
Yang
,
G.
,
Wang
,
B.
,
Li
,
X.
,
Wang
,
R.
,
Huang
,
X.
,
Wu
,
H.
, and
Wang
,
X. V.
,
2023
, “
An Attention-Based Deep Learning Approach for Inertial Motion Recognition and Estimation in Human-Robot Collaboration
,”
J. Manuf. Syst.
,
67
, pp.
97
110
.
14.
Caruana
,
R.
,
1997
, “
Multitask Learning
,”
Mach. Learn.
,
28
, pp.
41
75
.
15.
Cai
,
J.
,
Liang
,
X.
,
Wibranek
,
B.
, and
Guo
,
Y.
,
2023
, “
Multi-Task Deep Learning-Based Human Intention Prediction for Human-Robot Collaborative Assembly
,”
Computing in Civil Engineering 2023
,
Corvallis, OR
,
June 25–28
, pp.
579
587
.
16.
Liu
,
C.
,
Li
,
X.
,
Li
,
Q.
,
Xue
,
Y.
,
Liu
,
H.
, and
Gao
,
Y.
,
2021
, “
Robot Recognizing Humans Intention and Interacting With Humans Based on a Multi-Task Model Combining ST-GCN-LSTM Model and YOLO Model
,”
Neurocomputing
,
430
, pp.
174
184
.
17.
Crawshaw
,
M.
,
2020
, “Multi-task Learning With Deep Neural Networks: A Survey,” 2009.09796, https://arxiv.org/abs/2009.09796.
18.
Standley
,
T.
,
Zamir
,
A.
,
Chen
,
D.
,
Guibas
,
L.
,
Malik
,
J.
, and
Savarese
,
S.
,
2020
, “Which Tasks Should Be Learned Together in Multi-Task Learning?,”
Proceedings of the 37th International Conference on Machine Learning
, Vol.
119
,
H. D.
III
, and
A.
Singh
, eds., Proceedings of Machine Learning Research,
PMLR
, pp.
9120
9132
. https://proceedings.mlr.press/v119/standley20a.html
19.
Sutskever
,
I.
,
Vinyals
,
O.
, and
Le
,
Q. V.
,
2014
, “
Sequence to Sequence Learning With Neural Networks
,”
NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 27
,
Montreal, Quebec, Canada
,
Dec. 8–13
.
20.
Yousaf
,
K.
, and
Nawaz
,
T.
,
2022
, “
A Deep Learning-Based Approach for Inappropriate Content Detection and Classification of Youtube Videos
,”
IEEE Access
,
10
, pp.
16283
16298
.
21.
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L.
,
Gomez
,
A. N.
,
Kaiser
,
Ł.
, and
Polosukhin
,
I.
,
2017
, “
Attention is All You Need
,”
Advances in Neural Information Processing Systems 30 (NIPS 2017)
,
Long Beach, CA
,
Dec. 4–9
.
22.
Kao
,
C.-C.
,
Sun
,
M.
,
Wang
,
W.
, and
Wang
,
C.
,
2020
, “
A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification
,”
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
,
Barcelona, Spain
,
May 4–8
, IEEE, pp.
316
320
.
23.
Ma
,
H.
,
Zhang
,
Z.
,
Li
,
W.
, and
Lu
,
S.
,
2021
, “
Unsupervised Human Activity Representation Learning With Multi-Task Deep Clustering
,”
Association for Computing Machinery
,
5
(
1
), pp.
1
25
.
24.
Fu
,
Z.
,
Zhao
,
Y.
,
Chang
,
D.
,
Wang
,
Y.
, and
Wen
,
J.
,
2023
, “
Latent Low-Rank Representation With Weighted Distance Penalty for Clustering
,”
IEEE Trans. Cybern.
,
53
(
11
), pp.
6870
6882
.
25.
Zhang
,
X.
,
Yi
,
D.
,
Behdad
,
S.
, and
Saxena
,
S.
,
2023
, “
Unsupervised Human Activity Recognition Learning for Disassembly Tasks
,”
IEEE Trans. Ind. Inform.
,
20
(
1
), pp.
785
794
.