Can Simulation Entice Cooperation in Non-Zero-Sum Games?

Can Simulation Entice Cooperation in Non-Zero-Sum Games?

In honor of Joe Halpern's 70ies Birthday, a year ago Vincent Conitzer posted this puzzle: https://www.cs.cmu.edu/~conitzer/halpernbirthday.pdf

Can you solve it?



We provide a solution for Game 1.

Let us denote the probability that Player B plays 'left' with p_B, and the probability of a simulated example being presented to Player A with s in [0,1]. Then, we can formalize the expected outcomes and rewards as follows:

  • The probability p_A of Player A of playing 'top' is determined by three other probabilities p_0, p_1, p_2 in [0,1] which Player A must set as her strategy. p_0 is the probability of playing 'top' when no simulation takes place. p_1 is the probability of playing 'top' if simulation took place and Player B's action sample is 'left.' p_2, finally, is the probability that Player A plays 'top' when simulation took place and Player B's action sample was 'right.'

Based on the strategy (p_0,p_1,p_2), we can determine

In other words, the strategy for Player A has become a function of the strategy of Player B due to the occasional simulation.

  • Denote with B in R^{2 x 2} the reward matrix for Player B. The expected reward for Player B then is

Using Equation 1, and setting

we get

Therefore, given Player A’s strategy (p_0, p_1, p_2), Player B should choose p_B such that the above quadratic term is maximized.

  • In our game, we have

and therefore

As long as p_1 - p_2 = q > 2/3 , Player B will want to play the (pure) strategy p_B = 1, which means Player B will always play ’left.’

  • Consider the strategy (p0 = 1, p1 = 1, p2 = 0) for Player A, which implies q = 1 and therefore p_B = 1. Since Player B now only plays ’left,’ Player A will always play ’top’ (p_2 has become immaterial since no simulated sample will ever be ’right’). At the same time, this is the optimal outcome for Player A since no other cell offers a greater reward for her than top-left. Therefore, Player A has no incentive of changing her strategy. At the same time, given this strategy for Player A, as we have seen, Player B also has no incentive of changing his strategy either. We have therefore found an equilibrium with rewards (2, 3). Simulation has led to cooperation in this example.


Practical Gameplay

In the game above, we found that the reward matrix of the simulated player had equal sums on their two main diagonals, leading to x = 0 which turned the quadratic optimization problem for Player B into a linear optimization problem. Now, consider a game where this is not the case:

This time, we will simulate an action of Player B with probability 0.7.

a) Can you devise a strategy for Player A? Is there an equilibrium that is strictly greater than (2,2) for both players?

b) Consider a variant where Player B continues to want to maximize his expected reward, while Player A also wants to manage her risk of receiving a negative reward. Particularly, what strategy should Player A play if she wants to keep her expected rewards above 2.09 while minimizing her chances of running a loss?


Answering these questions in theory is very demanding. In practice, we may therefore revert to using a tool for optimizing our strategy. The stochastic optimizer InsideOpt Seeker, for example, can answer both questions with the following simple program:

import seeker as skr

def halpern(s, A, B):
    z = B[1][0] - B[1][1]
    y = B[0][1] - B[1][1]
    x = B[0][0] - B[1][0] - y

    env = skr.Env("license.sio", stochastic=True)
    env.set_restart_likelihood(0.05)
    p = [env.continuous(0, 1) for _ in range(3)]
    q = p[1] - p[2]
    quadratic_coeff = s * x * q
    mixed_case = quadratic_coeff < 0
    linear_coeff = s * y * q + (s * x) * p[2] + 
                       ((1 - s) * x) * p[0] + z
    b_mixed = env.max([env.convert(0), env.min([env.convert(1),
                       -linear_coeff / 
                       (2 * quadratic_coeff)])])
    b_extreme = env.if_(quadratic_coeff + linear_coeff > 0,
                        1, 0)
    p_B = env.if_(mixed_case, b_mixed, b_extreme)
    p_A = s * (p_B * q + p[2]) + (1 - s) * p[0]
    draw = [env.continuous_uniform(0, 1) for _ in range(2)]
    plays = [env.if_(draw[0] < p_A, 0, 1),
             env.if_(draw[1] < p_B, 0, 1)]
    reward_A = env.index([plays[0], plays[1]], env.tensor(A))
    reward_B = env.index([plays[0], plays[1]], env.tensor(B))
    exp_reward_A = env.aggregate_mean(reward_A)

    # maximize reward for Player A for 60 seconds
    env.maximize(exp_reward_A, 60)
    exp_reward_B = env.aggregate_mean(reward_B)
    prob_A = env.aggregate_relative_frequency_geq(reward_A, 0)
    env.evaluate()
    print("Maximizing Expected Profit")
    print("Prob A >= 0:", prob_A.get_value())
    print("Reward A:", exp_reward_A.get_value())
    print("Reward B:", exp_reward_B.get_value())
    print("Strategy A:", p_A.get_value(), 
              [p[i].get_value() for i in range(3)])
    print("Strategy B:", p_B.get_value())
    print("Evaluations", env.get_number_evaluations())
    print()

    # constrain the expected profit
    env.enforce_geq(exp_reward_A, 2.09)
    # maximize probability to achieve at least profit 0
    env.maximize(prob_A, 60)
    print("Maximizing Chance of Profit>=0 (exp profit>=2.09)")
    print("Prob A >= 0:", prob_A.get_value())
    print("Reward A:", exp_reward_A.get_value())
    print("Reward B:", exp_reward_B.get_value())
    print("Strategy A:", p_A.get_value(), 
              [p[i].get_value() for i in range(3)])
    print("Strategy B:", p_B.get_value())


if __name__ == "__main__":
    A = [[3, 0], [-1, 2]]
    B = [[5, 9], [0, 2]]
    halpern(0.7, A, B)        

As a result we get:

In the case of the simple maximization of the expected profits for both players, we find that the solver chooses to set p_0 = 0.877. As x = -2 in this game, lowering p_0 is an interesting option for getting Player B to cooperate more: We see that he plays 'left' with probability 85.66%. The equilibrium is achieved at (2.1055, 4.8377).

Seeker-provided reward densities for Player A after the risk minimization


In the case where we minimize Player A’s risk of running a negative reward, we find that Seeker increases p_0 = 1, which makes it beneficial for Player B to defect more. As a result, the expected reward for Player A declines slightly, but the risk of running a loss is reduced by 1.2%.


Conclusion

As you see, InsideOpt Seeker allows us to devise optimized strategies in competitive environments, even in the face of stochasticity and non-linearities. After all, not everyone is lucky enough to employ a Joe Halpern!

Happy Birthday, Joe!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics