q-learning

Kapalı İlan edilme: 4 yıl önce Teslim sırasında ödenir
Kapalı Teslim sırasında ödenir

Figure above shows the domain of a more complex game. There are 25 grid locations the agent

could be in. A prize could be on one of the corners, or there could be no prize. When the agent

lands on a prize, it receives a reward of 10 and the prize disappears. When there is no prize, for

each time step there is a probability that a prize appears on one of the corners. Monsters can

appear at any time on one of the locations marked M. The agent gets damaged if a monster

appears on the square the agent is on. If the agent is already damaged, it receives a reward of -

10. The agent can get repaired (i.e., so it is no longer damaged) by visiting the repair station

marked R.

In this example, the state consists of four components: ⟨X,Y,P,D⟩, where X is the X-coordinate of

the agent, Y is the Y-coordinate of the agent, P is the position of the prize (P=0 if there is a prize

on P0, P=1 if there is a prize on P1, similarly for 2 and 3, and P=4 if there is no prize), and D is

Boolean and is true when the agent is damaged. Because the monsters are transient, it is not

necessary to include them as part of the state. There are thus 5×5×5×2 = 250 states. The

environment is fully observable, so the agent knows what state it is in. But the agent does not

know the meaning of the states; it has no idea initially about being damaged or what a prize is.

The agent has four actions: up, down, left, and right. These move the agent one step - usually

one step in the direction indicated by the name, but sometimes in one of the other directions. If

the agent crashes into an outside wall or one of the interior walls (the thick lines near the

location R), it remains where it was and receives a reward of -1.

The agent does not know any of the story given here. It just knows there are 250 states and 4

actions, which state it is in at every time, and what reward was received each time. You need

to:

(i) Build a simulator that replicates the above behaviour of agent moving in the grid

world

(ii) Then, use Q-learning on the simulator built in (i) to learn the best policy for the

agent to move in this environment.

AI (Artificial Intelligence) HW/SW

Proje NO: #22303611

Proje hakkında

Uzak proje Aktif 4 yıl önce