## Abstract

Reinforcement learning algorithms can autonomously learn to search a design space for high-performance solutions. However, modern engineering often entails the use of computationally intensive simulation, which can lead to slower design timelines with highly iterative approaches such as reinforcement learning. This work provides a reinforcement learning framework that leverages models of varying fidelity to enable an effective solution search while reducing overall computational needs. Specifically, it utilizes models of varying fidelity while training the agent, iteratively progressing from low- to high fidelity. To demonstrate the effectiveness of the proposed framework, we apply it to two multimodal multi-objective constrained mixed integer nonlinear design problems involving the components of a ground and aerial vehicle. Specifically, for each problem, we utilize a high-fidelity and a low-fidelity deep neural network surrogate model, trained on performance data generated from underlying ground truth models. A tradeoff between solution quality and the proportion of low-fidelity surrogate model usage is observed. Specifically, high-quality solutions are achieved with substantial reductions in computational expense, showcasing the effectiveness of the framework for design problems where the use of just a high-fidelity model is infeasible. This solution quality-computational efficiency tradeoff is contextualized by visualizing the exploration behavior of the design agents.

## 1 Introduction

The discrete and multimodal nature of high-dimensional engineering design problems makes design synthesis challenging. This challenge is potentially met through deep Reinforcement Learning (RL) algorithms, which can autonomously learn to explore the design space based on the nature of the design problem [1–4]. However, in many modern design problems, expensive high-fidelity representations and simulations are necessary to accurately evaluate the performance of a potential solution. This can adversely affect the overall computational efficiency of deep RL algorithms, leading to slower design timelines, and higher economic and environmental risks [5,6]. As lower-fidelity simulations still contain potentially valuable information about the performance of a solution, they can be utilized to reduce the computational expense of exploration. This work explores the tradeoff between solution performance and computational efficiency when using different combinations of high- and low-fidelity models. Specifically, this work proposes an RL framework that utilizes models of varying fidelity [7–12] to search the design space for high-performance solutions.

Engineering analysis models serve the purpose of characterizing the relationship between the design of an engineered system and its performance attributes. The degree to which such a model can reproduce the behavior of a real-world system is referred to as model fidelity [11]. High-fidelity models typically use computationally expensive numerical simulations to accurately capture the underlying relationship of interest. Low-fidelity models are usually a simplification of the high-fidelity model. This can involve utilizing simpler geometric representations or physics models, simulating in a reduced dimensional space, using partially converged results, or preparing a data-fit surrogate model with high-fidelity simulation data [7]. While low-fidelity models are less accurate, they offer the advantage of being computationally cheaper than high-fidelity ones.

The combination of models at varying fidelity levels is common in engineering practice [7,9,10,12,13]. For instance, a first-order approximation is found in the design of buildings for seismic loading, where the dominant vibrational frequency of a building is approximated as the reciprocal of the number of stories in the building [14]. The approximation is typically used to complement high-fidelity seismic damage simulations for the safety of dense urban areas [15]. A variable-fidelity strategy that utilizes low-fidelity models for early-stage design exploration and high-fidelity models for the later stages may be able to balance the tradeoff between computational efficiency and solution quality. Further, the utilization of RL in such a variable-fidelity strategy makes it possible to learn exploration strategies that benefit from the varied feedback received at low-fidelity and high-fidelity levels. This strategy could involve the use of design representations and analysis models of varying fidelity during exploration. The proposed RL framework assumes a fixed design representation and encompasses analysis models of varying fidelity that compose the reward formulation.

The rest of the paper is organized as follows. Section 2 provides a brief introduction to design space exploration and discusses the potential of deep RL as an autonomous design optimizer. In Sec. 3, we propose an RL framework for design wherein the agent trades off between models with different computational costs and levels of fidelity and detail other methodologies used in this work. In Sec. 4, two multimodal multi-objective constrained mixed integer nonlinear design problems are introduced to demonstrate the effectiveness of the proposed framework. Section 5 presents the results of the case studies, including an analysis of exploration behavior and an assessment of the tradeoff between solution quality and computational efficiency. Section 6 summarizes the contribution of the paper and proposes several directions for future work.

## 2 Background

### 2.1 Design Space Exploration.

The design of engineered systems often involves the abstract arrangement of components, the selection of specific components for the arrangement, and the assignment of parameter values to the parameterized components. In some cases, design also entails the synthesis of new components; however, when no new components are being synthesized, the design problem reduces to a configuration design problem [16,17]. When the arrangement of components is fixed, the design task reduces to a skeletal design problem [16], which is the focus of this work. It involves selection from sets of all the types of components (e.g., battery choice, controller choice) and assignment of values to the discrete and continuous parameters associated with each component (e.g., physical parameters governing component size or cyber parameters in the controller cost function). The design space of the system is composed of all combinations of the design variables, including both component choices and the associated component parameters. These design variables can be used to compute multiple objectives involving system performance, cost, and other relevant attributes associated with different disciplinary domains [18].

Decision-making involved in the design of an engineered system is sequential in nature [19]. Specifically, it involves searching the design space of the system to determine which combination of variables yields optimal designs. This is referred to as design space exploration [18]. However, when the design space is enormous, it may be infeasible to achieve designs that meet optimality criteria. Accordingly, algorithms attempt to search the design space in an optimally directed fashion to reach designs that satisfice [20,21]. As design problems are often multimodal in nature and involve discrete design variables, it limits the use of gradient-based optimization algorithms. Rather, gradient-free optimization algorithms are preferred for exploring the design space. For instance, Stoecklein et al. [22] employed an evolutionary algorithm for a highly multimodal and discrete design problem involving micropillar sequences for fluid flow sculpting. However, research on optimization algorithms shows that there is no single algorithm whose performance dominates others [23]. Moreover, it was found in that work that all algorithms can provide the best solution for at least some problems. Accordingly, the designer will need to iteratively implement different algorithms to find the most suitable one. For instance, Saldanha et al. [24] demonstrate a methodology for choosing the best evolutionary algorithm for a heat exchanger design problem from a finite set of algorithm alternatives.

A methodology that can autonomously choose or learn an algorithm for design space exploration could be beneficial for design space exploration. Li and Malik [1] have demonstrated that algorithms designed by an RL agent can outperform existing algorithms in terms of solution quality and computational efficiency. To this end, we utilize RL for autonomously exploring the design space of engineered systems.

### 2.2 Reinforcement Learning-Driven Design.

RL algorithms [25] can iteratively learn effective strategies for the sequential decision-making task of exploring the design space [3,4]. Moreover, they can leverage exploration data in future iterations more efficiently than other design algorithms. For instance, Lee et al. [2] have identified deep RL approaches to be more data efficient than evolutionary optimization approaches for a multimodal and discrete fluid flow sculpting design problem. To emphasize, on the one hand, the genetic algorithm-based design approach uses a widely applicable heuristic at each iteration of the exploration. On the other hand, an RL agent learns to explore by creating a mapping from the design space to an action space that maximizes the long-term collection of rewards across several iterations. By learning strategies specific to the characteristics of the design space, it attempts to maximize the solution quality for the design problem. Further, RL-based design approaches possess generalization and transfer capabilities [3,26,27]. Lastly, when compared to other machine learning-based design approaches, an RL approach can accommodate non-differentiable objective formulations and is not limited by the data input by the designer [28].

While RL algorithms can learn to optimize efficiently from the agent's experience, many engineered systems demand the use of expensive high-fidelity representations and simulations (like computational fluid dynamics or finite element analysis) for evaluating objective functions and constraints that compose the agent reward. First, this can adversely affect computational efficiency of the RL algorithm leading to slower design timelines. There has been an increasing interest in reducing the timelines from years to months in recent years [29]. Second, the high computational and energy expense incurred in deep learning implementations is becoming economically and environmentally unsustainable [5,6]. Martínez-Plumed et al. [30] have identified that insufficient effort has been put toward dimensions like computational and data efficiency in the race to achieve performance benchmarks. For instance, an open reimplementation of the RL-based AlphaZero was trained using 2000 NVIDIA V100 graphics processing units (GPU) with 87 years of GPU time [31]. These aspects limit the applicability of standard deep RL algorithms for engineering design. Thereby, it is important to develop frameworks that improve their computational efficiency. The sequential design process typically involves the sequencing of representations and analysis models of varying fidelity to reduce computational demands [8]. For instance, Mehmani et al. [9] and Wang et al. [10] utilize models of varying fidelity by progressively transitioning to higher fidelity levels to find solutions to design problems efficiently. Accordingly, we hypothesize that an RL approach that progressively utilizes models from low- to high-fidelity [7] could achieve the desired solution quality at a reasonable computational cost.

## 3 Methodology

This work proposes an RL framework for design space exploration using models of varying fidelity. This section outlines the specific methodology used to construct the framework. In Sec. 3.1, we formalize the skeletal design problem as a multi-objective constrained mixed integer problem. Further, we build upon this to prepare a mathematical formulation of the problem involving models of varying fidelity. Based on this formulation, Sec. 3.2 proposes the RL framework and details the agent–design space interaction. Section 3.3 outlines the methodology for training neural networks to serve as tunable surrogate models of varying fidelity. Section 3.4 describes the parametric study for training RL agents with different proportions of a low- and high-fidelity model.

### 3.1 Skeletal Design Problem Formulation.

**(**

*f***,**

*x***)) defined by several continuous (**

*y***) and discrete (**

*x***) design variables such that a set of inequality (**

*y***(**

*g***,**

*x***)) and equality (**

*y***(**

*h***,**

*x***)) constraints are satisfied. For a system involving**

*y**p*objectives,

*m*continuous variables,

*n*discrete variables,

*r*inequality constraints, and

*s*equality constraints, this is mathematically defined in negative null form according to the traditional optimization paradigm as follows:

**, of size**

*F**p*×

*q*

_{f}, referred to as the objective fidelity matrix, where

*p*is the number of objectives and

*q*

_{f}is the maximum number of fidelity levels at which any objective is defined. We intentionally make few assumptions about the form of the engineering analysis models from which these objectives are evaluated. For instance, some objective terms may not be computable for every model. As a convention, the objectives are ordered from the lowest to the highest fidelity level. Further, for objectives with lesser than

*q*

_{f}fidelity levels, the remaining terms of the column are kept 0. The matrix is defined as follows:

*f*

_{ij}is the

*i*th objective defined at the

*j*th fidelity level.

**, of size**

*G**r*×

*q*

_{g}, referred to as the inequality constraint fidelity matrix, where

*r*is the number of inequality constraints and

*q*

_{g}is the maximum number of fidelity levels at which any inequality constraint is defined. These constraints may or may not be associated with the same engineering analysis models as in the objective fidelity matrix. For instance, a problem may have an objective evaluated using a fluid flow model, while the constraint could be evaluated using a structural model. Further, the ordering of constraints and the absent fidelity levels for a specific constraint follow the same treatment as the objective fidelity matrix. The matrix is defined as follows:

*g*

_{ij}is the

*i*th inequality constraint defined at the

*j*th fidelity level.

**, of size**

*H**s*×

*q*

_{h}, referred to as the equality constraint fidelity matrix, where

*s*is the number of equality constraints and

*q*

_{h}is the maximum number of fidelity levels at which any equality constraint is defined

*h*

_{ij}is the

*i*th equality constraint defined at the

*j*th fidelity level.

*W*_{f}, of size

*p*×

*q*

_{f}. The objective weighting matrix is defined as follows:

*w*

_{ij}is the weight towards the

*i*th objective defined at the

*j*th fidelity level.

*f*′ as defined below:

Similarly, we define constraint weighting matrices, *W*_{g} and *W*_{h}, of size *r* × *q*_{g} and *s* × *q*_{h}, respectively. Accordingly, the weighted constraint fidelity matrices, ** G**′ and

**′ are defined as follows:**

*H*Unlike the objective matrix, these do not involve a summation of the terms which makes it possible instead to uniquely penalize the agent for every constraint it violates, as detailed in the reward formulation in Sec. 3.2.

While it is customary to reduce a multi-objective problem to a single objective problem by a weighted sum in optimization algorithms, our approach also provides the flexibility to choose different fidelity levels for different objectives and constraints in different portions of the search. This results in a reduced computational expense when some of the objectives and constraints utilize just some low-fidelity model in some portions of the search. Specifically, this is achieved by using sparse weights across fidelity levels of a particular objective or constraint.

### 3.2 Reinforcement Learning Framework.

A reinforcement learning-based design agent can solve the skeletal design problem by starting with a seed design and iteratively tuning the continuous and discrete variables to minimize the objective, *f*′ while satisfying the constraints ** G**′ and

**′. Accordingly, the agent–design space interaction when the agent transitions from state**

*H**t*to state

*t*+ 1 during training is illustrated in Fig. 1 and discussed hereinafter.

**) in an agent state (**

*a***) based on the feedback received in the form of scalar rewards (**

*s**R*). Specifically, the agent needs to learn a policy, i.e., a mapping from the state space to the action space to maximize the sum of rewards it sees over time. At the iteration

*t*, the agent state (

*s*_{t}) is composed of the design state (

*d*_{t}) and the elements of the weighting matrices in the iteration as specified by the weighting schedules ((

*W*_{f})

_{t}, (

*W*_{g})

_{t}, (

*W*_{h})

_{t}). The design state (

*d*_{t}) is defined by the design variables (

*x*_{t},

*y*_{t}) that define the system. Accordingly, the agent state is defined as follows:

*a*_{t}) define how much each design variable needs to be incremented or decremented in the iteration. For the discrete variables, we utilize a rounding approach that generalizes well across the available discrete options [32].

While the agent cannot modify the weighting matrices, the contents of those matrices are still useful to the agent to condition its learning at different fidelity levels. For instance, if the agent happens to be in the same design state at different fidelity levels, having information about the operating fidelity level would enable it to learn to make decisions based on the reward computed using that specific fidelity level. However, as the actions depend on both the design state and the fidelity level, there would be an interaction between them.

*R*

_{t+1}) measures the quality of the action (

*a*_{t}) that transitions the design from state

*d*_{t}to

*d*_{t+1}. This depends on the amount by which the scalar objective

*f*′ reduces. Further, when the agent is in the infeasible domain, the change in the amounts by which each of the constraints in the weighted constraint matrices (

**′,**

*G***′) are violated would guide the agent to navigate to feasible regions. Accordingly, three reward functions (**

*H**R*

_{f},

*R*

_{g},

*R*

_{h}) are defined that compose the agent reward as shown below:

*R*

_{f}is a function that rewards or penalizes the agent based on how much the objective value reduces or increases in the iteration, i.e.,

*R*

_{g}is a function that penalizes or rewards the agent based on how much more or less it violates a constraint when it is in the infeasible region, i.e.,

*R*

_{h}is a function that penalizes or rewards the agent for how much more or less it steps away from the equality constraint hypersurface, i.e.,

*Q*, is defined as follows:

*q*

_{f},

*q*

_{g}, and

*q*

_{h}indicate the highest fidelity levels associated with each term.

### 3.3 Training Deep Neural Network Surrogates.

To demonstrate the proposed approach of RL using models of varying fidelity, we use neural networks as a tunable approach to construct surrogate models of varying accuracy and cost based on data sets generated from an underlying ground truth model. Neural networks were chosen because they can be easily tuned to achieve different levels of fidelity, for instance by varying the network architecture or the training data set. Hereinafter, the methodology to train the networks is detailed.

In this work, the weights toward various objectives and constraints at a specific fidelity level are kept constant throughout the training. The multi-fidelity multi-objective formulation therefore reduces to just a multi-fidelity formulation. This also permitted the utilization of a single neural network to predict all the objectives and constraint terms at a particular fidelity level. Specifically, we utilize two neural networks serving as models of high- and low-fidelity to predict the objective and constraint terms.

The number of neurons in the input layer is equal to the sum of the number of design variables (*m* + *n*). Further, the number of neurons in the output layer is equal to the sum of the number of objectives and constraints (*p* + *r* + *s*). The number of hidden layers, i.e., the depth of the neural network and the number of neurons per layer, i.e., the width of the neural network are the key components in the design of the neural network. They are tuned to obtain models of varying fidelity. For instance, a model with zero hidden layers and linear activation functions can serve as a model of low fidelity. On the other hand, a network with large width and depth can serve as a high-fidelity model.

The accuracy of a surrogate model is also influenced by the number of samples utilized to construct the surrogate [7]. In the context of neural networks, the size of the data set that is used for training influences the model accuracy [33–35]. In this work, both the architecture of the neural network and the size of the dataset used for training were varied to prepare models of varying fidelity. Specifically, deeper networks and larger data sets were used for higher levels of fidelity. The depth of the networks and the size of the datasets used for training were found by iterative tuning to prepare disparate models. This process was supported by AutoKeras [36] in some cases, as it offers the flexibility of training models of varying fidelity by explicitly specifying the maximum number of parameters allowed while searching for a neural network architecture that performs well on a dataset.

### 3.4 Training and Evaluating Reinforcement Learning Agents.

To demonstrate the tradeoff between computational efficiency and solution quality using the proposed RL framework, a parametric study is conducted with varying training schedules as shown in Fig. 2. As the formulation is reduced from multi-objective multi-fidelity to bi-fidelity, we utilize a scalar binary parameter, *w*′, that determines the operating fidelity level (*w*′ = 0 for low-fidelity, *w*′ = 1 for high-fidelity). The parameter, *n* in Fig. 2 determines the iteration at which the fidelity level switches from the low-fidelity neural network to the high-fidelity one. This essentially governs the proportion of usage of the low- and high-fidelity models.

A proximal policy optimization algorithm [37] is used for training the policies with several values of *n*. Several randomly sampled designs that may or may not satisfy the constraints are used as seed designs for training each policy. The number of iterations per episode, the total number of episodes, and other RL hyperparameters are tuned to yield designs that satisfice for a policy that utilizes just the high-fidelity network for each problem and are kept constant throughout the parametric study.

*Q*using the high-fidelity model, it is also evaluated using the low-fidelity model to understand the exploration behavior of the agent, as discussed in Sec. 5. Accordingly, the two metrics are defined as follows:

To understand the behavior of exploration using the proposed framework, a two-dimensional embedding is trained for visualizing several trajectories in the design space [38]. Specifically, Principal Component Analysis (PCA) is performed using all the design state vectors that were visited while evaluating all the trained policies.

To understand the tradeoff between computational efficiency and solution quality, the data from all cases of the study are processed. As the total number of iterations is constant across the study, the total time required to evaluate the objectives and constraints using the low- and high-fidelity models is utilized to reflect the computational efficiency. The solution quality values that are evaluated using the high-fidelity model, as per Eq. 19, are utilized for eliciting the tradeoff trend. Specifically, an exponential curve is fitted using a least squares method using the data from all cases.

## 4 Case Studies

The motivation for the case studies is to demonstrate the effectiveness of the proposed reinforcement learning framework for design space exploration and the tradeoff between computational efficiency and solution quality. We consider two multimodal multi-objective constrained mixed integer nonlinear skeletal design problems involving the physical components of a ground and aerial vehicle. The details of the ground and aerial vehicle problem are described in Secs. 4.1 and 4.2, respectively.

### 4.1 Ground Vehicle Problem.

The ground vehicle skeletal design problem is based on prior work on Formula Society of Automotive Engineers (SAE) vehicles [39,40]. It involves multiple subsystems of the vehicle such as suspension, wings, etc. Figure 3 shows a schematic of the vehicle along with the number of design variables associated with different subsystems. The design space of the problem comprises 29 continuous (e.g., cabin length, wing length) and 10 discrete (e.g., engine choice from a set of 21 engines) variables. To reduce the dimensionality of the problem, the discrete variables are transformed from nominal to ordinal based on their size and key performance indicators. By considering all possible discrete values and merely 10 values for the continuous variables, the size of the combinatorial space is of the order of 10^{39}, a value comparable to the state space size of 10^{40} in chess [41].

The design objective is defined by a set of 11 sub-objectives to judge the overall performance of the system. These are the mass of the vehicle, center of gravity height, drag, downforce, acceleration, crash force, impact attenuator volume, cornering velocity, braking distance, suspension acceleration, and pitch moment. Further, the design is subject to 80 practical inequality constraints (which are the choice of modeler, e.g., rear wing length should have a value of at least 0.05 m) and natural inequality constraints (which represent physical necessity, e.g., a positive ground normal reaction). These include 78 linear equality constraints and two nonlinear inequality constraints. The reader is referred to prior work [40] for the detailed analytical expressions associated with the objectives and constraints.

*N*

_{i}, and $ONo,k$ represents the output layer of size

*N*

_{o}with activation

*k*. Further, the smaller dataset is utilized for training this network.

*N*

_{i},

*D*

_{j,k}represents a dense (hidden) layer of size

*j*with an activation denoted by

*k*,

*S*represents the sigmoid activation,

*R*represents the ReLU activation,

*L*represents linear activation, and $ONo,k$ represents the output layer of size

*N*

_{o}with activation

*k*.

The neural networks are trained using the mean squared error loss function and the adaptive moment estimation optimizer. The prediction accuracies of the objectives and constraints are measured using the coefficient of determination of the individual components as well as the overall coefficient of determination of the predictions. The high-fidelity model accuracies range from 0.977 to 0.999 for the individual components. The low-fidelity model accuracies range from −0.661 to 0.493 for the individual components. The higher accuracies of the high-fidelity network are due to the larger size of the dataset and a larger neural network as compared to the low-fidelity network. While some values of the coefficients of determination of the individual components are negative (indicating a poor fit), the remaining components could still drive the exploration behavior toward high-quality regions on average. The medians of the computational expense of prediction using the neural networks and the accuracy as measured by the overall coefficient of determination of the predictions are illustrated in Fig. 4.

The objective weights associated with different sub-objectives are adapted from Soria Zurita et al. [40] and are kept constant throughout all training schedules. Further, all the constraint weights are kept equal to 1. The objective and constraint models of the neural networks along with these weights compose the reward function. The state of the agent is defined by the design variables and the scalar binary parameter, *w*′ that determines the operating fidelity level. To ensure that the agent makes stable progress across the iterations of a learning episode, the magnitude of allowable design parameter changes is capped at a value of one-tenth of the range of the continuous variables and a value of 2 for the ordinal variables.

### 4.2 Aerial Vehicle Problem.

The aerial vehicle skeletal design problem involves a quadcopter that is designed using a corpus of components and a high-fidelity flight dynamics model [42]. The components include batteries, electronic speed controls (ESCs), motors, and propellers. The design space of the problem comprises two continuous variables—arm length and support length—and four ordinal variables for the choice of batteries, ESCs, motors, and propellers from an ordered set of the components like the previous case study. The reader is referred to prior work [42] for details on the corpus of components used in this problem. Figure 5 illustrates the skeletal design artifact of a quadcopter generated by assigning random values to the design variables. By considering all possible discrete values and merely 10 values for the continuous variables, the size of the combinatorial space is of the order of 10^{8}. While the number of design variables is lower than in the previous case study, this problem involves a larger number of choices for the ordinal variables.

The design objective is defined by a vector of five sub-objectives to judge the overall performance of the system. These include the maximum hover time, maximum attainable speed, range covered at this maximum attainable speed, maximum coverable range, and speed maintained to cover this range. To emphasize, these objectives aim at developing fast, long-range quadcopters. Further, the design is subject to 27 inequality constraints associated with physical interferences in design and the operating limits of the quadcopter components. These include fixed bounds on the six design variables and 15 nonlinear inequality constraints. The reader is referred to prior work [42] on the flight dynamics model for further details on these objectives and constraints. Despite having a smaller combinatorial state space than the previous study, this one has a higher number of nonlinear constraints.

The neural networks are trained using the mean squared error loss function and the adaptive moment estimation optimizer like the previous problem. The prediction accuracies of objectives and constraints for the high-fidelity network are lower than the previous problem because of the smaller dataset. The accuracies of the individual components of the high-fidelity predictions range from 0.395 to 0.907. The accuracies of the individual components of the low-fidelity predictions range from −2.531 to 0.629. The higher accuracies of the high-fidelity network are due to the larger neural network as compared to the low-fidelity network. Like the previous problem, while some values of the coefficient of determination of the individual components are negative (indicating a poor fit), the remaining components could still drive the exploration behavior toward high-quality regions. The medians of the computational expense of prediction using the neural networks, and the accuracy as measured by the overall coefficient of determination of the predictions is illustrated in Fig. 6.

The objective weights are kept equal due to the lack of expert knowledge of specific design requirements. Further, all the constraint weights are equated to 1. The agent state, actions, reward formulation, number of iterations per episode, and the RL hyperparameters are the same as in the previous case study.

## 5 Results and Discussion

### 5.1 Ground Vehicle Problem.

The RL policies were trained and evaluated for the cases *n* = {0, 10, 20, …, 80, 90, 100} for the ground vehicle problem. Specifically, 6000 seed designs were randomly sampled and utilized for training and evaluating all the policies. The results of the evaluation for four cases (low-fidelity alone, mixed with low-fidelity dominant, mixed with high-fidelity dominant, and high-fidelity alone) are shown in Fig. 7 and discussed in further detail. These cases specifically correspond to the parameter values of *n* = {0, 30, 70, 100}. The quality of the solutions for all the cases is better than the seed designs, indicated by the upward trend in all plots. This showcases the ability of the proposed variable-fidelity framework to effectively search for solutions. Further, the solution qualities are higher for the cases *n* = {0, 30} than the cases *n* = {70, 100}. Aside from the final quality values, the nature of the quality-iteration plots is different for different cases. For the case *n* = 0 (Fig. 7(a)), the quality increases with a small rise in dispersion in the initial iterations. For the case *n* = 30 (Fig. 7(b)), the dispersion in initial iterations is higher than in the case *n* = 0. However, it eventually converges into a high-quality region. For the case *n* = 70 (Fig. 7(c)), the quality rises slowly when the low-fidelity model is operational. Further, a steep rise in quality is observed when the agent switches to the high-fidelity model. However, it does not converge to a high-quality region because of the limited number of iterations left after the switch. Lastly, for the case *n* = 100 (Fig. 7(d)), the nature of the plot is similar to the case *n* = 0 (Fig. 7(a)). However, it converges to a region of lower quality. To emphasize further, the difference in the plots is observed even when the low-fidelity model has been operational for the same number of iterations in different cases. This indicates that the agent is exploring different regions of the design space even when using the low-fidelity model for the same number of iterations before the switch.

To understand the exploration behavior of the RL agents, a two-dimensional embedding is prepared by performing PCA on all the design vectors that were visited while evaluating all the trained policies. The embedding is visualized in Fig. 8. In these sub-figures, the scatter plot shows all the regions of the design space that were visited during policy evaluation across all cases. The color reflects the quality of the design as measured by the high-fidelity and low-fidelity model in Figs. 8(a) and 8(b), respectively. Further, we plot one seed and the evaluated trajectories for each of the four cases discussed earlier. Additionally, for the mixed-fidelity cases, we highlight the agent step at which the fidelity level changes.

First, the colormaps are indicative that the low-fidelity model significantly deviates from the high-fidelity model in several regions of the design space. For instance, in the case *n* = 0, the trajectory converges to a high-quality region as per the high-fidelity model (Fig. 8(a)). However, this region has poor quality as per the low-fidelity model (Fig. 8(b)). For the case *n* = 100, the trajectory converges to another region that has the highest quality as per the low-fidelity model. However, this region has a moderate quality as measured by the high-fidelity model. This explains the difference in the qualities that were observed in Figs. 7(a) and 7(d), respectively. For the mixed-fidelity cases, the trajectories lie in between the high-fidelity and low-fidelity trajectories based on which model is dominant. Moreover, the trajectories are significantly different for these cases even before the agent switches from the low-fidelity model to the high-fidelity one. We attribute this to the evaluative feedback received from future states (including the states when the high-fidelity model is operational) and the interaction in the learning at both fidelity levels in a specific design state. In the case *n* = 30, the location of the trajectory is shifted due to the influence of the low-fidelity model. Accordingly, it passes through regions of lower quality than the case *n* = 0 when measured by the high-fidelity model. However, it eventually converges into a high-quality region. This is in accordance with a higher dispersion followed by convergence to a high quality that was observed in Fig. 7(b). For the case *n* = 70, the trajectory shifts further toward the trajectory of the case *n* = 100. In this region, the quality (as measured by the high-fidelity model) along the trajectory rises slowly when the low-fidelity model is operational. Further, the direction of the trajectory changes drastically when the model switches. Specifically, it starts moving toward a high-quality region as measured by the high-fidelity model. However, it makes limited progress as only a few iterations are remaining when this model is operational. Again, this corroborates with the quality-iteration plot in Fig. 7(c).

To understand the tradeoff between computational efficiency and solution quality, the data from all the cases (*n* = {0, 10, 20, …, 80, 90, 100}) was processed. Specifically, for each case, the total time for evaluating the objectives, constraints, and solution quality as measured by the high-fidelity metric was computed and is shown in Fig. 9. For the cases when the high-fidelity model is dominant (i.e., *n* = {0, …, 50} on the right side of the plot), high quality of solutions is maintained even with a significant reduction in compute time. For the cases *n* = {60, 70, 80}, the quality values have high dispersion with lower or comparable values than other cases that have a lesser computation time. This is attributed to the fact that the agent has few iterations remaining to be able to converge to a different region after switching to the high-fidelity model. This behavior is detailed for the case *n* = 70 in Figs. 7(c) and 8. For the cases *n* = {90, 100}, we achieve a moderate quality of solutions based on the low-fidelity model. Lastly, the quality of the seed designs is plotted at *t* = 0 as the sampling of seed designs does not involve the computation of objectives and constraints. To elicit a tradeoff trend from the data, an exponential curve is fitted using a least squares method using the solutions obtained from all seed designs across all the cases. Specifically, we use the form *Q*_{h} = *ae*^{kt} + *b*, where *b* is the asymptotic value achieved by *Q*_{h} when the high-fidelity model is dominant, *a* + *b* is the *Q*_{h}-intercept that reflects the quality of the seed designs, and *k* is a parameter that reflects the rate at which the quality changes. The goodness of this fit is measured using the coefficient of determination and is noted in Fig. 9. A high value of the coefficient of determination for the best fit curve shows the suitability of the chosen functional form and a good resultant fit for the data. We observe that the solution quality increases with computation time across the cases. Further, it should be noted that this curve of solution quality versus compute time is concave and approaches an asymptotic value of −0.173. This indicates that good solutions can be achieved with substantial reductions in compute time.

### 5.2 Aerial Vehicle Problem.

The RL policies were trained and evaluated for the aerial vehicle problem, for the same cases and the same number of seed designs as the previous problem. The results of evaluating the policies for the cases *n* = {0, 30, 70, 100} are shown in Fig. 10. The quality of the solutions for all the cases is better than the seed designs, again showing that the framework effectively searches for solutions in all cases. Further, the solution qualities drop as the low-fidelity model usage increases, similar to the previous problem. For the cases *n* = 0 and *n* = 30 (Figs. 10(a) and 10(b)), the quality increases in a similar manner to yield high-performance solutions. For the case *n* = 70 (Fig. 10(c)), the rate at which quality improves rises a bit when the agent switches models. Lastly, for the case *n* = 100 (Fig. 10(d)), the agent converges to a region of lower quality than the other cases. Unlike the previous problem, the nature of the quality-iteration plots is similar across the cases until the low-fidelity model is operational. This indicates that the agent may be exploring a similar region of the design space across the cases when the low-fidelity model is operational before the switch.

To understand the exploration behavior of the RL agents, a two-dimensional embedding is prepared similar to the previous problem and is shown in Fig. 11. First, the colormaps are indicative that the low-fidelity model deviates from the high-fidelity model mainly in the latter portions of the trajectory. The four trajectories in Fig. 11 follow a similar path until the last few iterations. For the cases *n* = 0 and *n* = 30, the agent solutions converge to a similar region as was indicative in Figs. 10(a) and 10(b). For the case *n* = 100, the trajectory changes direction toward a high-quality region as measured by the low-fidelity model. However, this region has lower quality when measured by the high-fidelity model. This explains the lower quality that is observed in Fig. 10(d). For the case *n* = 70, the trajectory follows the same direction as the case *n* = 100 until the low-fidelity model is operational. After the model switches, it changes its direction toward the high-quality region as measured by the high-fidelity model. While this improves the quality, this improvement is limited by the number of remaining iterations. These search patterns explain the quality-iteration plots of Figs. 10(c) and 10(d).

To understand the tradeoff between computational efficiency and solution quality, the data from all the cases (*n* = {0, 10, 20, …, 80, 90, 100}) were processed. Specifically, for each case, the total time for evaluating the objectives, constraints, and solution quality as measured by the high-fidelity metric was computed and is shown in Fig. 12. For the cases when the high-fidelity model is dominant (i.e., *n* = {0, …, 50} on the right side of the plot), a high quality of solutions is maintained even with a significant reduction in compute time. For the remaining cases *n* = {60, …, 100}, we observe a steady decrease in quality with a decrease in computation time. Lastly, the quality of the seed designs is plotted at *t* = 0 as the sampling of seed designs does not involve the computation of objectives and constraints. To elicit a tradeoff trend from the data, an exponential curve is fitted similarly to the previous problem and is shown in Fig. 12. The high value of the coefficient of determination 0.729 confirms the suitability of the chosen functional form. We observe that the solution quality increases with computation time across the cases. Further, it should be noted that this curve of solution quality versus compute time is concave and approaches an asymptotic value of −0.205. This indicates fairly good solutions can be achieved with substantial reductions in computing time.

These parametric case studies showcase that the proposed framework can not only balance the tradeoff between computational efficiency and solution quality but also find high-performance solutions to high-dimensional design problems where the use of just a high-fidelity model is infeasible.

## 6 Conclusion

This paper proposes a reinforcement learning framework based on models of varying fidelity that addresses the computational expense of the high-fidelity simulations often necessary to evaluate objective and constraint functions in design space exploration. It uses neural network models of varying fidelity and gives the RL agent flexibility to incorporate predefined constant or variable schedules for exploration using these models. We showcase the potential of the framework in two case studies that involve the design of the physical components of a ground and aerial vehicle. The RL agent converges to high-performance regions of the design space using objective evaluations at two fidelity levels. A parametric study with different training schedules for exploration at these fidelity levels demonstrates the tradeoff between computational efficiency and solution quality. Further, a concave tradeoff trend showcases the potential of the framework to find high-performance solutions to design problems where the use of just a high-fidelity model is infeasible. Lastly, the exploration behavior of the agents is discussed by visualizing an embedding space.

Future work should explore the application of this framework to design problems beyond skeletal design. These include configuration design problems based on graph grammar representations. While RL has been used in conjunction with shape grammars for generative design [43], the design space exploration using such a representation is not researched upon using variable-fidelity models. Alternatively, this could involve learning embeddings for representing the design space [44,45] and extending the existing framework for exploring this embedding space. The weighting schedules of the framework can also be potentially designed or made adaptive based on multi-fidelity model management strategies [7,9,10,13], knowledge of expert designer behavior [46], or RL-based approaches [47]. The variation of attributes like episode length and the number of episodes could also reveal different patterns across low-fidelity model usage. Lastly, the search space can be modified to bias the search toward non-intuitive solutions by incorporating curiosity [48] into the agent reward.

The two case studies are also limited to the application to the physical domains of the engineered system. It would be interesting to evaluate the potential of the proposed frameworks for the co-design of the cyber and physical domains of such systems. For instance, in prior work that addressed an intelligent manufacturing shop floor [49], the design space of the physical components of robots and their control policies can be explored together to yield high-performing shop floors. A semi-automated human-in-the-loop strategy, involving a designer or domain expert who steers the exploration tool after few iterations with the help of visualization can also be an extension to the proposed framework. With humans creating associations across design domains, and machines recognizing statistical patterns from data, such a framework could lead to a symbiotic exploration paradigm.

## Acknowledgment

The authors are grateful to the Southwest Research Institute for providing simulation capabilities for the drone case study used in this work. We are also grateful to Susmit Jha and Adam Cobb of SRI International for their feedback on early versions of this work.

## Funding Data

This material is based upon work supported by the Defense Advanced Research Projects Agency through cooperative agreement FA8750-20-C-0002. Any opinions, findings, and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the sponsors.

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.