Can Simulation Entice Cooperation in Non-Zero-Sum Games?
In honor of Joe Halpern's 70ies Birthday, a year ago Vincent Conitzer posted this puzzle: https://www.cs.cmu.edu/~conitzer/halpernbirthday.pdf
Can you solve it?
We provide a solution for Game 1.
Let us denote the probability that Player B plays 'left' with p_B, and the probability of a simulated example being presented to Player A with s in [0,1]. Then, we can formalize the expected outcomes and rewards as follows:
Based on the strategy (p_0,p_1,p_2), we can determine
In other words, the strategy for Player A has become a function of the strategy of Player B due to the occasional simulation.
Using Equation 1, and setting
we get
Therefore, given Player A’s strategy (p_0, p_1, p_2), Player B should choose p_B such that the above quadratic term is maximized.
and therefore
As long as p_1 - p_2 = q > 2/3 , Player B will want to play the (pure) strategy p_B = 1, which means Player B will always play ’left.’
Recommended by LinkedIn
Practical Gameplay
In the game above, we found that the reward matrix of the simulated player had equal sums on their two main diagonals, leading to x = 0 which turned the quadratic optimization problem for Player B into a linear optimization problem. Now, consider a game where this is not the case:
This time, we will simulate an action of Player B with probability 0.7.
a) Can you devise a strategy for Player A? Is there an equilibrium that is strictly greater than (2,2) for both players?
b) Consider a variant where Player B continues to want to maximize his expected reward, while Player A also wants to manage her risk of receiving a negative reward. Particularly, what strategy should Player A play if she wants to keep her expected rewards above 2.09 while minimizing her chances of running a loss?
Answering these questions in theory is very demanding. In practice, we may therefore revert to using a tool for optimizing our strategy. The stochastic optimizer InsideOpt Seeker, for example, can answer both questions with the following simple program:
import seeker as skr
def halpern(s, A, B):
z = B[1][0] - B[1][1]
y = B[0][1] - B[1][1]
x = B[0][0] - B[1][0] - y
env = skr.Env("license.sio", stochastic=True)
env.set_restart_likelihood(0.05)
p = [env.continuous(0, 1) for _ in range(3)]
q = p[1] - p[2]
quadratic_coeff = s * x * q
mixed_case = quadratic_coeff < 0
linear_coeff = s * y * q + (s * x) * p[2] +
((1 - s) * x) * p[0] + z
b_mixed = env.max([env.convert(0), env.min([env.convert(1),
-linear_coeff /
(2 * quadratic_coeff)])])
b_extreme = env.if_(quadratic_coeff + linear_coeff > 0,
1, 0)
p_B = env.if_(mixed_case, b_mixed, b_extreme)
p_A = s * (p_B * q + p[2]) + (1 - s) * p[0]
draw = [env.continuous_uniform(0, 1) for _ in range(2)]
plays = [env.if_(draw[0] < p_A, 0, 1),
env.if_(draw[1] < p_B, 0, 1)]
reward_A = env.index([plays[0], plays[1]], env.tensor(A))
reward_B = env.index([plays[0], plays[1]], env.tensor(B))
exp_reward_A = env.aggregate_mean(reward_A)
# maximize reward for Player A for 60 seconds
env.maximize(exp_reward_A, 60)
exp_reward_B = env.aggregate_mean(reward_B)
prob_A = env.aggregate_relative_frequency_geq(reward_A, 0)
env.evaluate()
print("Maximizing Expected Profit")
print("Prob A >= 0:", prob_A.get_value())
print("Reward A:", exp_reward_A.get_value())
print("Reward B:", exp_reward_B.get_value())
print("Strategy A:", p_A.get_value(),
[p[i].get_value() for i in range(3)])
print("Strategy B:", p_B.get_value())
print("Evaluations", env.get_number_evaluations())
print()
# constrain the expected profit
env.enforce_geq(exp_reward_A, 2.09)
# maximize probability to achieve at least profit 0
env.maximize(prob_A, 60)
print("Maximizing Chance of Profit>=0 (exp profit>=2.09)")
print("Prob A >= 0:", prob_A.get_value())
print("Reward A:", exp_reward_A.get_value())
print("Reward B:", exp_reward_B.get_value())
print("Strategy A:", p_A.get_value(),
[p[i].get_value() for i in range(3)])
print("Strategy B:", p_B.get_value())
if __name__ == "__main__":
A = [[3, 0], [-1, 2]]
B = [[5, 9], [0, 2]]
halpern(0.7, A, B)
As a result we get:
In the case of the simple maximization of the expected profits for both players, we find that the solver chooses to set p_0 = 0.877. As x = -2 in this game, lowering p_0 is an interesting option for getting Player B to cooperate more: We see that he plays 'left' with probability 85.66%. The equilibrium is achieved at (2.1055, 4.8377).
In the case where we minimize Player A’s risk of running a negative reward, we find that Seeker increases p_0 = 1, which makes it beneficial for Player B to defect more. As a result, the expected reward for Player A declines slightly, but the risk of running a loss is reduced by 1.2%.
Conclusion
As you see, InsideOpt Seeker allows us to devise optimized strategies in competitive environments, even in the face of stochasticity and non-linearities. After all, not everyone is lucky enough to employ a Joe Halpern!
Happy Birthday, Joe!