Full article: Empirical study of privacy inference attack against deep reinforcement learning models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Most studies on privacy in machine learning have primarily focused on supervised learning, with little research on privacy concerns in reinforcement learning. However, our study has demonstrated that observation information can be extracted through trajectory analysis. In this paper, we propose a variable information inference attack targeting the observation space of policy models, which is categorised into two types: observed value inference attack and observed variable inference. Our algorithm has demonstrated a high success rate in privacy inference attacks for both types of observation information.

Keywords:

1. Introduction

Machine learning research has experienced rapid growth, leading to significant advances in image recognition (He et al., Citation2016), natural language processing (Vaswani et al., Citation2017) and robotic control (Lillicrap et al., Citation2015). However, the application of these technologies carries a significant risk of potential data privacy breaches. When deep learning models are deployed, they provide a new means of accessing information from the training dataset, which can reveal private information that attackers may exploit. Thus, the deployment of deep learning models poses a significant security risk that requires addressing. For example, membership inference attacks (Shokri et al., Citation2017) allow querying a model to determine whether a sample is in the training set, and (Ganju et al., Citation2018) present an inference attack to extract dataset attributes. In some cases, attempts are even made to reconstruct the entire training set, resulting in severe privacy leaks in publicly used models. The issue of privacy arises in both supervised and reinforcement learning (RL). The study by Pan et al. (Citation2019) emphasise that policy models in RL are at risk of leaking information about the environment. RL models in various domains, such as healthcare (Esteva et al., Citation2019), contain sensitive data that may be exploited, leading to privacy disclosure. Therefore, further research into the privacy concerns of reinforcement learning is necessary.

This study investigates a scenario in which multiple reinforcement learning models are used to perform the same decision task with distinct datasets. The situation is akin to the parable (Goldstein, Citation2010) of the blind men attempting to describe an elephant by feeling it from different parts, as it can enable inference attacks to extract information about the agent's observation based on its decision performance. The authors consider two possible scenarios. In the first, attackers are privy to the state variables used in model training but lack knowledge of their specific values. In the second scenario, attackers know the complete state variable type of the task but not the state variable type of the observation data used in model training. This scenario could occur in domains such as finance, where institutions with different data ownership may use different observation data to train models for trading stocks in the same environment. By stealing observation information from a model, attackers can determine decision-making characteristics and launch targeted attacks to degrade the model's performance or manipulate its choices in favour of the attacker. It is crucial to emphasise the need for additional research into privacy and security concerns in reinforcement learning.

This study explores a scenario in which multiple reinforcement learning models work on a single decision task but use different datasets. The situation is analogous to the well-known fable (Goldstein, Citation2010) of the six blind men trying to describe an elephant by touching different parts of its body. Although each of the blind men had a distinct perception of the elephant's appearance, it was possible to infer which part of the elephant they were touching based on their descriptions. In the same vein, inference attacks can be utilised to extract information about an agent's observation by analysing its decision-making performance on a task. In this study, we examine two scenarios. In the first scenario, attackers have knowledge of the state variables used in model training but lack information on their specific values. Their objective is to infer the values of these variables. In the second scenario, attackers have knowledge of the complete state variable type of the task, but not the state variable of the observation data used in model training. To be specific, two commonly used experimental environments can be used as examples to illustrate the difference between the two scenarios. First, let's consider the Cartpole's pole length in the gym environment. This parameter is an environmental variable that can take different values, and this variability leads to different model behaviours. Next, let's consider MuJoCo, where a robot makes judgments based on the information from its various joints. If the robot perceives different joint information, these different joint information will be different observation variables. In real-world applications, different institutions may have varying datasets and use distinct observation data to train their models. Even though these institutions may be operating in the same environment, their decision characteristics may differ. Consequently, an attacker can steal observation information from a model. If an adversary obtains access to a model's training data, they can use it to identify the specific decision-making patterns of the target model. Using this knowledge, they can launch targeted attacks aimed at compromising the model's performance or influencing its decisions in favour of the attacker.

Inferring observational information is an intricate and multifaceted problem that involves grappling with a vast array of potential observations and numerous combinations of observed variables. Additionally, the relationship between observations and decision features of the model remains largely unknown, making the task of directly inferring the observational knowledge of a model from data highly challenging without imposing restrictions on the range of inference. This constraint is a fundamental assumption in current approaches. Previous research (Pan et al., Citation2019) has leveraged the mean and variance of rewards to infer observation information, but our approach goes beyond this by utilising trajectory data, which contains more detailed training environment information. We have developed a robust attack framework that effectively utilises trajectory data for privacy theft and experimentally validated it across various benchmark environments for reinforcement learning. Our results demonstrate successful inference of variable values in the Gym environment and variable types in the MuJoCo environment.

In summary, our contributions can be stated as follows:

We performed an empirical analysis to demonstrate the privacy concerns related to observational data, specifically focusing on two types of information: variable values and observed variables.
We have proposed a novel approach that leverages trajectory data for inferring information and established a practical attack framework for stealing privacy.
We have experimentally validated the effectiveness of our attack framework in various benchmark environments and achieved a high success rate.

2. Related work

2.1. Machine learning privacy leakage

Previous studies have extensively investigated the issue of privacy theft in machine learning models using training data. Most of these studies have focused on supervised learning models, including attacks like membership inference (Carlini et al., Citation2019; Shokri et al., Citation2017), attribute inference (Gong & Liu, Citation2016), and dataset reconstruction (Salem et al., Citation2020). Among these attacks, membership inference has become the most popular type, where the goal is to determine whether a given data sample was part of the training set used to build the model. White-box attacks, which leverage model parameters and gradients, have been shown to be more accurate. Although this type of attack may not appear to pose a significant privacy risk in everyday scenarios, it could be a major concern in areas such as health analytics, where differences between cases and controls could reveal sensitive conditions. Membership inference has been widely studied in the context of machine learning, with several recent works, including (Ai et al., Citation2021; Hou et al., Citation2022; Li et al., Citation2022; Liang et al., Citation2022; G. Lin et al., Citation2022; Yan, Hu et al., Citation2021; Yan, Jiang et al., Citation2021), all focusing on related issues.

Reconstruction attacks aim to reconstruct one or more training samples or their corresponding labels, which can be partial or complete. These attacks utilise output labels and partial knowledge of specific features to recover sensitive features or complete data samples. Attribute inference (Gong & Liu, Citation2016), on the other hand, focuses on dataset attributes that are irrelevant to the learning task, such as the gender distribution in a patient dataset. The leakage of these properties can compromise privacy and provide additional information about the training data. While the distribution of certain attributes may differ between datasets, in reinforcement learning models, the attributes of datasets depend on the environment, making it challenging to determine if a sample point was part of the sampling process. Our recent work on attribute inference attacks in reinforcement learning models has demonstrated the potential for machine learning models to reveal private information.

2.2. Privacy risk in reinforcement learning

Reinforcement learning algorithms have been shown to be vulnerable to exploitation by attackers, as highlighted by Huang et al. (Citation2017). In particular, neural network policies used in reinforcement learning can be easily fooled by adversarial examples, where slight modifications to the input can cause the model to produce incorrect results. Since then, there has been a surge of research on the issue of adversarial attacks in reinforcement learning, with approaches such as Y. C. Lin et al. (Citation2017), who proposed a new method to generate unique adversarial examples at each time step of the game in Atari games, and (Mo et al., Citation2022), who developed methods for launching adversarial attacks on reinforcement learning algorithms. This growing interest is driven by the increasing practical applications of reinforcement learning in modern AI and its critical role in AI security.

Reinforcement learning is not immune to privacy attacks, and several studies have highlighted this issue. For instance, Masters and Sardina (Citation2019) consider the goal of proxies as a privacy concern, while (Wang & Hegde, Citation2019) and (Liu et al., Citation2021) argue that the reward function can reveal sensitive information about the agent's objectives. Another privacy attack, proposed by Pan et al. (Citation2019), aims to steal transition dynamic information. Additionally, Henderson et al. (Citation2018) argue that reinforcement learning is data-sensitive and can therefore be susceptible to privacy leakage. These studies underscore the need for robust privacy protection in reinforcement learning, given its increasing use in various domains.

3. Background

3.1. Markov decision process

Markov Decision Process (MDP), which can be defined as a 5-tuple $(S, A, T, R, γ)$ , is a sequential decision process for a fully observable. S is state space, and A is action space. The transition dynamics T is defined as a probability mapping from state-action pairs to states. The reward function R calculates the reward of action to the task, which applies γ to the accumulated reward to correct future rewards. In a partially observable Markov decision process (POMDP), a tuple $(S, A, T, O, R, γ)$ , the agent cannot observe the underlying state, although the underlying state transition T is the same as in an MDP, which leads POMDP to an additional observation space O.The model of POMDP receives an observation $o_{t + 1} \in O$ when reaching the next state $s_{t + 1}$ . The goal for an agent in either MDPs or POMDPs is to maximise its expected future discounted return, where it is the immediate reward received at time t and $γ \in [0, 1]$ .

3.2. Reinforcement learning

Reinforcement learning is a methodological paradigm for solving sequential decision problems, such as MDP and POMDP. Reinforcement learning obtains the essential transition and reward model of a task through interaction and iterates to obtain the best action policy in the current environment. At step t, An agent acting dependent on a policy to interact with the environment when observing the current state st. To continuously receive new states and rewards, an agent interacts with the environment and learns an optimal policy to maximise the return expected in the future. For MDP, The policy $π (a_{t} | s_{t})$ is the state-value function under the entire state. For POMDP, since the state s is not observable, the observation o is used in the learning policy.

4. Methodology

In this section, we will first introduce the privacy inference problem and provide a definition of it. Then, we will focus on a specific attack method that targets two types of information.

4.1. Problem definition

The scenario we are considering involves multiple agents working on a decision task, each observing different parts of the environment corresponding to distinct partially observable Markov decision processes with the same underlying Markov decision process. The agents are trained on different decision models with different observations, resulting in distinct decision characteristics. An attacker aims to infer the observation information of these different models based on these characteristics. The difference in observation spaces is captured by the difference in state variables. A state s is defined by a set of state variables $s = {u_{1}, u_{2}, \dots, u_{m}}$ , with a corresponding state space S and a state variable set $U = {u_{1}, u_{2}, \dots, u_{m}}$ . An observation space o is a state set $o = {s_{1}, s_{2}, s_{3}, \dots, s_{m}}$ , where the state variables and their values form an observation space. A state space S has an associated observation space set $O = {o_{1}, o_{2}, o_{3}, \dots o m}$ . This study aims to infer the information of the observation space o of the target model, identifying two different types of partial observations. Each scenario proposes a unique inference attack.

The first type of partial observation involves an observation space where all state variables are included, with $U_{O}$ = $U_{S}$ . However, at least one variable, $u_{i}$ , has a different value in the observation space. In this scenario, the attacker has knowledge of the observation variables but lacks information about the value of some of them. The goal of the attacker is to infer the value of the target model's observation variables. This attack is referred to as observed value inference since it seeks to infer the value of the environment variable.

The second type of partial observation occurs when the set of observation variables, denoted as $U_{O}$ , is incomplete, excluding at least one state variable from the set $U_{S}$ . This creates different observation spaces, each varying by at least one observation variable. In this scenario, the attacker is assumed to have knowledge of the complete state and all environmental variables, but lacks information about the specific variables observed by the target model. The goal of the attacker is to identify these variables and deduce the variables used in training the target model. This attack is referred to as observed variable inference, as it seeks to infer the specific environmental variables observed by the target model.

4.2. Method

In this section, we propose a method to address the privacy inference problem by presenting a paradigm for the inference attack and a specific inference process for different types of partial observation. The entire attack framework is depicted in Figure . The observation space O can be inferred by using the observed behavioural trajectory of the target model. Given that the behavioural trajectory $τ = {s_{0}, a_{0}, s_{1}, a_{1}, \dots, a_{t}}$ generated from the policy model in the POMDP set displays distinct behavioural characteristics, we propose formulating the problem by classifying trajectories generated from different policy models. To obtain behavioural characteristics from different observation spaces, we require a set of models. A set of observation spaces, denoted as O, forms a POMDP set, which corresponds to a set of models denoted as $π = {π_{o_{1}}, π_{o_{2}}, \dots, π_{o_{m}}}$ . Various observations impact the transition probability function that the agent perceives. There are two types of partial observation: one in which one or more observation variables have different values, and the other in which at least one observation variable is missing. While both types can be inferred in a similar manner, the trajectory data used for inference will differ. In the following section, we will examine the unique characteristics of each type of partial observation.

Figure 1. The attack workflow of observed information inference.

4.2.1. Observed value inference

In this scenario, there are two observation spaces: $O_{1}$ and $O_{2}$ , both containing the same state variables. However, the observed set of state variables in these spaces differ in the value of a subset of state variables, denoted as $U_{d i f f} = {u_{1}, u_{2}, \dots, u_{n}}$ . The attacker can observe the state and action of the target model, and by knowing the state variable type composition, can acquire the corresponding state-action pair data. The form of this trajectory data is $τ = {s_{0}, a_{0}, s_{1}, a_{1}, s_{2}, a_{2}, \dots, s_{m}, a_{m}}$ .To perform inference, the attacker needs to train a trajectory classifier using the state-action pair data $D_{τ_{s a}} = {X, Y}$ , where $X = {s_{0}, a_{0}, s_{1}, a_{1}, s_{2}, a_{2}, \dots, s_{m}, a_{m}}$ and Y is the observation label. During the inference phase, the observed state-action pair trajectories of the target model can be directly input into the model to obtain the inference results.

4.2.2. Observed variable inference

In this scenario, we have two observation spaces, namely $O_{1}$ and $O_{2}$ , with different sets of state variables. Each observation space must have at least one distinct observation variable. As the attacker has access only to the action data of the target model, they can only obtain the corresponding action data. This data takes the form of a trajectory, denoted as $τ = {a_{0}, a_{1}, \dots, a_{m}}$ . To train the trajectory classifier, we need the action data, denoted as $D_{τ_{s a}} = {X, Y}$ , where $X = {a_{0}, a_{1}, \dots, a_{m}}$ and Y represents the observation label. During the inference phase, the observed action trajectories of the target model can be directly inputted into the model to obtain the inference results.

4.2.3. Attack model

To conduct the inference attack, attackers must train a trajectory classifier to classify the target model based on its trajectory. This classifier is trained on a dataset of trajectories from different observation spaces. To obtain this dataset, the attacker can train surrogate models on the decision task environment and construct different observation environments using different observation space sets $O = {o_{1}, o_{2}, \dots, o_{n}}$ . The set of surrogate models obtained is denoted as $π_{s u r r o g a t e} = {π_{o_{1}}, π_{o_{2}}, \dots, π_{o_{m}}}$ , each of which generates a behavioural trajectory set $D_{τ} = {τ_{1}, τ_{2}, \dots, τ_{m}}$ , with each trajectory beginning with a different initial state $s_{0}$ . However, as these models are designed to perform the same task in the same environment, the trajectory data must be sampled from the same environment. The attacker can accomplish this by using a surrogate model to sample N trajectory sequences of length H in a specific environment. This process is repeated for each surrogate model in the set, resulting in a set of trajectory data $D_{τ} = {τ_{1}, τ_{2}, \dots, τ_{m}}$ , where each τ corresponds to a different observation space. After obtaining the state space trajectory dataset $D = {x, y}$ , where x is a trajectory $τ = {s_{0}, a_{0}, s_{1}, a_{1}, \dots, a_{t}}$ or $τ = {a_{0}, a_{1}, \dots, a_{m}}$ , and y is the corresponding observation space label, the attacker can use this dataset to train a trajectory classifier. Once the classifier is trained, the attacker can collect the actual behaviour trajectory of the target model and input it into the classifier. The prediction result of the classifier will indicate the observation space of the target model.

The loss function of the classifier is as follows: (1) $l o s s = - \sum_{c = 1}^{M} y_{o, c} \log (p_{o, c}) .$ (1) The number of classes, denoted as m, varies based on the environment. We can define a binary indicator variable y, where y takes on the value of 1 if the class label c is the correct classification for a given observation o, and 0 otherwise. Additionally, let p represent the predicted probability of the observation belonging to class y.

5. Experiments

In this section, we present the results of our experiments demonstrating how an attacker could potentially acquire private information through variable inference using the shadow model in two scenarios. We conducted these experiments on a suite of continuous control benchmarks in the Gym and the PyBullet version of MuJoCo tasks. Our goal is to determine which variable a well-trained Deep Reinforcement Learning agent observed in two types of partial observation environments.

The first type involves different environment parameters, which can be obtained by controlling the value of environment parameters. The state variable's value is a sample from the environment, and the observation space of the environment is also considered. The second type of partial observation involves different models trained with different variable type compositions of the observation. We controlled the input of the models to construct different observation spaces.

We will proceed to explain how we constructed the partial observation environments in these two scenarios.

5.1. Environment

The method used to construct the environments in our study is based on a previous work by Packer et al. (Citation2018). In this approach, the environment parameters are fixed at predetermined values to create different observation environments. Here are some details regarding the construction of these environments.

Cartpole: The object is to balance a pole on a cart by controlling the cart's movement along a track. Three parameters affect the behaviour of the system: the magnitude of the push force applied to the cart, the length of the pole, and the mass of the pole.

Mountain Car: The objective is to move a car uphill, and the environment can be adjusted by varying two parameters: push force and car mass.

Acrobot: The objective of this task is to control a bar with an actuator at the joint between its two links. The bar has three adjustable parameters: length, mass, and moment of inertia, which are the same for both links.

Pendulum: The objective is to maintain the vertical position of a pendulum by applying a continuous force. Two environmental parameters that can be manipulated are the pendulum's length and mass.

To generate new observation environments, a single parameter is altered at a time. For instance, in the Cartpole task, the pole length is adjusted to create observation environments. Ten observation environments were created by adjusting the pole lengths from 0.1 to 1, while maintaining default values for other parameters. These values were then assigned as observation labels for use during the inference phase. The same approach was used to create observation environments for the other tasks, with the number of environments varying depending on the parameters. Table provides an overview of the environmental settings.

Table 1. Different value environment.

Download CSV Display Table

In this study, observation with different variables was created by masking state variables. The PyBullet, which is an open-source implementation of the MuJoCo environment based on BulletPhysics, was used, and the source code and environment were obtained from Ni et al. (Citation2021). To conduct experiments, four different environments were used.

Hopper: The objective of the Hopper environment is to make a one-legged robot hop forward as fast as possible in a two-dimensional space. The observation space of Hopper consists of 15 state variables.

HalfCheetah: The objective of the HalfCheetah environment is for the robot to learn to walk on a track without losing balance by applying continuous forces to its joints. The observation space of HalfCheetah consists of 26 state variables.

Walker: The objective of the Walker environment is to teach a two-dimensional bipedal robot to walk forward at the maximum possible speed. The observation space of Walker consists of 22 state variables.

Ant: The objective of the Ant environment is for a four-legged robot to learn to walk and navigate through its environment by applying continuous valued forces to its joints. The observation space of Ant consists of 28 state variables.

The objective of these environments is to learn how to walk and traverse the terrain. To create different observation states, a single variable is masked, resulting in a set of distinct observation states for each environment (as shown in Table ). The position of the mask variable within the input vector corresponds to the observation label utilised during the inference phase.

Table 2. Different variable environment.

Download CSV Display Table

5.2. Training attack model

Shadow models can be trained using partial observation environments to generate different trajectories. The number of models required depends on the number of observation spaces. For each observation environment, the best model after 3 million steps is selected as the representation. Popular algorithms for training models include PPO (Schulman et al., Citation2017) and SAC(Haarnoja et al., Citation2018), using a multi-layer perceptron as the policy. Each episode is performed with a random environment seed, with the same seeds used in different observation spaces of the same task.

For observed value inference, state-action pairs are sampled, while for observed variable inference, action sequences are sampled. To ensure fairness, all models in a task sample data in the same environment. In this study, 32 random seeds are used to sample data in each environment, with eight seeds used for training classifiers and the remaining seeds used for evaluation. 10,000 trajectories are sampled and labelled with their corresponding observation space. An LSTM classifier is employed for both inferences, with the last layer of the classifier dependent on the dataset and the number of observation spaces.

6. Result

The goal of observed value inference is to infer the fixed value of an observation variable, while observed variable inference pertains to the inference of the observed variable of the target model. The results presented in Table show that the shadow policies approach achieved high accuracy for variable inference across all tasks. This indicates that partial observation can significantly impact the behaviour of the model, which in turn can allow an attacker to deduce which state variables were observed during the training of a given policy, posing a privacy-leaking challenge for existing reinforcement learning algorithms like PPO and SAC.

Table 3. Inference accuracy.

Download CSV Display Table

To analyse the results of observed value inference, we can gain deeper insights by examining the performance of models trained on different observed value environments as well as the confusion matrices. The test-time performance of models trained on different observations of variable values is illustrated in Figure . Partial observations can result in diverse but comparable performances when executing the same task under different observations, indicating that reinforcement learning models possess a certain degree of transferability. However, selecting the specific observation variable values requires a careful analysis of the decision sequence. The confusion matrices shown in Figure suggest that the method performs well overall but may exhibit bias in certain scenarios, such as the Mountain Car task across different environments. To improve the model's performance, it may be worthwhile to explore the incorporation of additional data.

Figure 2. The reward of models trained on different observations of the gym environments. (a) CartPole. (b) MountainCar. (c)Acrobot. (d) Pendulum.

Figure 3. The confusion matrices of observed value inference attack against the Gym environments. (a) CartPole. (b) MountainCar. (c) Acrobot. (d) Pendulum.

To analyse the results of observed variable inference, we can employ the same approach as used for analysing observed value inference. This involves evaluating the performance of models trained on different observed variables and examining the corresponding confusion matrices. The test-time performance of models trained on different observed variables is presented in Figure . The results indicate that variations in observed variables can significantly impact the model's performance. This finding also suggests that an LSTM classifier can accurately identify observation labels based on these variations, and that different decision variables can have varying degrees of influence on performance levels. The confusion matrices presented in Figure indicate that the method performs well overall, although bias is observed in certain scenarios, such as the Hopper task across different environments. To improve the model's performance, adjusting the model's parameters or retraining it may be effective strategies to explore.

Figure 4. The reward of models trained on different observations of the MuJoCo environments. (a) Hopper. (b) Walker. (c) HalfCheetah. (d)Ant.

Figure 5. The confusion matrices of observed variable inference attack against the MuJoCo environments. (a) Hopper. (b) Walker. (c) HalfCheetah. (d) Ant.

7. Discussion

In this section, we will discuss the basis of two types of inference attacks and possible privacy protection methods for each of them.

In the scenario of observed value inference, the values of environmental variables give rise to distinct data distributions during the process of interaction and sampling. This, in turn, leads to varying perceptions of the environment by the model, who encounters different data distributions, learns distinct probability transition matrices, and exhibits distinct planning behaviours. There are two main approaches to address privacy concerns in machine learning. The first approach is to improve the model's generalisation performance by enhancing the sampling distribution during training, aiming to ensure that the model's decision-making is more robust and independent of specific distributions. The second approach is to employ privacy-preserving techniques such as differential privacy, which can blur the relationship between the model's decisions and the data distribution to protect privacy. For example, in the gym environment, if we can train on observation spaces with different variable values, it may improve the model's generalisation performance, resulting in adaptive behaviour in different distributions, rather than decision characteristics associated with a specific distribution. Similarly, differential privacy makes it difficult to discern the distribution of training data, resulting in decision characteristics that are not distinguishable from each other.

In the scenario of observed variable inference, When training a model with different variables, the model's decision-making can depend on the observed variables. These variables can introduce environmental information that leads to incorrect perceptions of the environment and different decision-making in certain states. If certain variables are absent, the observed variables may become highly sensitive to minor changes, resulting in strange decisions. For example, an autonomous vehicle may not detect certain obstacles due to the absence of certain variables, resulting in noticeable differences in its decision-making. One possible solution to this challenging problem is to use privacy protection measures or to introduce random decision-making to reduce the salience of action features. For example, in the case of the MuJoCo environment, robots can make a wide range of motion choices during their locomotion, making it difficult for an attacker to obtain information about observed variables. This can help achieve protection against attacks.

8. Conclusion

This study aims to develop deep reinforcement learning policies that are resilient to privacy-leaking attacks. Two types of inference attacks were proposed: observed value inference and observed variable inference attacks. The attacks correspond to two settings: (1) constructing different environments with variable values and representing the model's characteristics using action-state pair data, and (2) assuming that different models have different variable inputs and proposing an algorithm that uses action sequences to infer which variables were used to train a given policy. The experiments demonstrated that deep RL models are vulnerable to privacy inference attacks and that specific information about the training observation can be accurately inferred.Partial observation is a prevalent problem in reinforcement learning, and more research is necessary to fully comprehend its impact on policy characteristics. Reinforcement learning algorithms can aid in understanding the environment even with incomplete information and may perform as well as full-observation approaches while avoiding privacy risks. One possible solution is to diversify the enhancement policies, where trained RL policies can affect more features in different observations, making them more resistant to privacy attacks.

This research provides valuable insights into developing robust reinforcement learning algorithms that can defend against privacy-leaking attacks. Specifically, we propose an inference attack on the observation space of RL and demonstrate that the information contained therein is highly susceptible to such attacks. To address this vulnerability, the paper explores two types of partial observation attacks and presents methods for inferring private information about the observation from well-trained policies under different observations. These findings offer a new perspective on data bias in privacy reinforcement learning and suggest further exploration of the RL observation space as future work. Overall, this work contributes to the development of RL algorithms that can protect the privacy of the training observation and enhance their robustness against privacy-leaking attacks.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research was financially supported by Systems Engineering Research Institute of China State Shipbuilding Corporation (CSSC) [Grant No. 193-A11-107-01-33].

References

Ai, S., Hong, S., Zheng, X., Wang, Y., & Liu, X. (2021). CSRT rumor spreading model based on complex network. International Journal of Intelligent Systems, 36(5), 1903–1913. https://doi.org/10.1002/int.v36.5
Web of Science ®Google Scholar
Carlini, N., Liu, C., & Song, D. (2019). The secret sharer: Evaluating and testing unintended memorization in neural networks. In Usenix security symposium (Vol. 267). USENIX Association.
Google Scholar
Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C., Corrado, G., Thrun, S., & Dean, J. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29. https://doi.org/10.1038/s41591-018-0316-z
PubMed Web of Science ®Google Scholar
Ganju, K., Wang, Q., Yang, W., Gunter, C. A., & Borisov, N. (2018). Property inference attacks on fully connected neural networks using permutation invariant representations. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security (pp. 619–633). Association for Computing Machinery.
Google Scholar
Goldstein, E. B. (2010). Encyclopedia of perception. Sage.
Google Scholar
Gong, N. Z., & Liu, B. (2016). You are who you know and how you behave: Attribute inference attacks via users' social friends and behaviors. In Usenix security symposium (pp. 979–995). USENIX Association.
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning (pp. 1861–1870). PMLR.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). IEEE Computer Society.
Google Scholar
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32). AAAI Press.
Google Scholar
Hou, R., Ai, S., Chen, Q., Yan, H., Huang, T., & Chen, K. (2022). Similarity-based integrity protection for deep learning systems. Information Sciences, 601, 255–267. https://doi.org/10.1016/j.ins.2022.04.003
Web of Science ®Google Scholar
Huang, S., Papernot, N., Goodfellow, I., Duan, Y., & Abbeel, P. (2017). Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284.
Google Scholar
Li, Y., Yan, H., & Huang, T. (2022). Model architecture level privacy leakage in neural networks. Journal of Science China Information Sciences, 65(7), 1–14. https://www.sciengine.com/SCIS/doi/10.1007/s11432-022-3507-7
Google Scholar
Liang, C., Miao, M., Ma, J., Yan, H., Zhang, Q., & Li, X. (2022). Detection of global positioning system spoofing attack on unmanned aerial vehicle system. Concurrency and Computation: Practice and Experience, 34(7), e5925. https://doi.org/10.1002/cpe.v34.7
Web of Science ®Google Scholar
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
Google Scholar
Lin, G., Yan, H., Kou, G., Huang, T., Peng, S., Zhang, Y., & Dong, C. (2022). Understanding adaptive gradient clipping in DP-SGD, empirically. International Journal of Intelligent Systems, 37(11), 9674–9700. https://doi.org/10.1002/int.v37.11
Web of Science ®Google Scholar
Lin, Y. C., Hong, Z. W., Liao, Y. H., Shih, M. L., Liu, M. Y., & Sun, M. (2017). Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748.
Google Scholar
Liu, Z., Yang, Y., Miller, T., & Masters, P. (2021). Deceptive reinforcement learning for privacy-preserving planning. arXiv preprint arXiv:2102.03022.
Google Scholar
Masters, P., & Sardina, S. (2019). Goal recognition for rational and irrational agents. In Proceedings of the 18th international conference on autonomous agents and multiagent systems (pp. 440–448). IFAAMAS.
Google Scholar
Mo, K., Tang, W., Li, J., & Yuan, X. (2022). Attacking deep reinforcement learning with decoupled adversarial policy. IEEE Transactions on Dependable and Secure Computing, 20(1), 758–768. https://doi.org/10.1109/TDSC.2022.3143566
Web of Science ®Google Scholar
Ni, T., Eysenbach, B., & Salakhutdinov, R. (2021). Recurrent model-free rl is a strong baseline for many pomdps. arXiv preprint arXiv:2110.05038.
Google Scholar
Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., & Song, D. (2018). Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282.
Google Scholar
Pan, X., Wang, W., Zhang, X., Li, B., Yi, J., & Song, D. (2019). How you act tells a lot: Privacy-leakage attack on deep reinforcement learning. arXiv preprint arXiv:1904.11082.
Google Scholar
Salem, A. M. G., Bhattacharyya, A., Backes, M., Fritz, M., & Zhang, Y. (2020). Updates-leak: Data set inference and reconstruction attacks in online learning. In 29th Usenix security symposium (pp. 1291–1308). USENIX Association.
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Google Scholar
Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP) (pp. 3–18). IEEE Computer Society.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 6000–6010. https://dl.acm.org/doi/10.5555/3295222.3295349
Google Scholar
Wang, B., & Hegde, N. (2019). Privacy-preserving q-learning with functional noise in continuous spaces. Advances in Neural Information Processing Systems, 32, 11327–11337. https://dl.acm.org/doi/10.5555/3454287.3455303
Google Scholar
Yan, H., Hu, L., Xiang, X., Liu, Z., & Yuan, X. (2021). PPCL: Privacy-preserving collaborative learning for mitigating indirect information leakage. Information Sciences, 548, 423–437. https://doi.org/10.1016/j.ins.2020.09.064
Web of Science ®Google Scholar
Yan, H., Jiang, N., Li, K., Wang, Y., & Yang, G. (2021). Collusion-free for cloud verification toward the view of game theory. ACM Transactions on Internet Technology (TOIT), 22(2), 1–21. https://doi.org/10.1145/3423558
Web of Science ®Google Scholar

Empirical study of privacy inference attack against deep reinforcement learning models

Abstract

1. Introduction