Artificial Intelligence for Business

Reinforcement Learning

Reinforcement Learning (RL), a versatile and dynamic subset of Artificial Intelligence, is increasingly prominent across a spectrum of industries, demonstrating its broad adaptability and immense potential. In the world of finance, RL algorithms are critical for fine-tuning strategies in areas such as algorithmic trading and portfolio management. The healthcare sector benefits from RL's capability to personalize treatment plans, tailoring them to meet the unique needs of individual patients. In the field of robotics, RL plays a pivotal role in teaching machines to perform complex tasks, such as navigation, through a process of trial and error. Furthermore, the gaming industry leverages RL to create AI capable of mastering strategic games, providing profound insights into complex decision-making.

In the emerging arena of Large Language Models (LLMs), the application of Reinforcement Learning via Human Feedback (RLHF) is particularly significant. RLHF involves the iterative refinement of LLMs through human input, substantially enhancing their capacity to generate responses that are not only more relevant and accurate but also finely attuned to context and human preferences.

In this seven-week course, you will delve into the core principles of Reinforcement Learning (RL) techniques, gaining the expertise to craft intelligent agents adept at addressing real-world business challenges across diverse domains. This journey will equip you with a skill set that is not only immediately applicable to a myriad of business scenarios but also serves as a solid foundation for advanced studies. By its conclusion, you will be well-prepared to innovate and implement solutions that can significantly contribute to business growth and success, leveraging the transformative power of RL in the business sphere.

See Course Spotlight

Student feedback: [2025][2024]

Announcements

The course starts on March 25, 2025. Look forward to seeing you in class!

A. Course logistics

We designed this course with specific policies to foster a productive and engaging educational atmosphere. We encourage you to carefully review these guidelines, which are fundamental to our collective learning experience.  

A.1. Teaching modality: This course is conducted exclusively in person. As we aim to enhance the quality of our interactive discussions, there will be no live streaming or recording of the lectures. Moreover, to foster a dynamic and participatory classroom environment, we are committed to providing a safe and inclusive space, encouraging every student to engage in discussions confidentially (within our classroom environment). In light of these goals, we consciously opt against recording our sessions.

A.1.2. Recording policy: To uphold the privacy and intellectual property rights of everyone in our course, we strictly forbid any form of digital recording during our sessions. These forms include video, audio, screenshots, and photographs. Violating this policy is prohibited under any circumstances, and storing recorded material from our lectures can present significant liability.

A.1.3 Attendance Policy: Regular attendance is essential for success, as sessions build on prior material and require active participation. Students must attend all classes and arrive on time.

  • Absences: Absences should be rare and limited to legitimate reasons (e.g., illness, emergencies, religious observances). Notify the instructor promptly, ideally before class, and submit a plan to complete missed work within one week, subject to approval. More than one unexcused absence requires a meeting with the instructor to continue attending. Excessive absences (over 10% of sessions) will automatically result in a failing grade (F).
  • Punctuality: Arriving over 10 minutes late counts as a partial absence; three late arrivals equal one absence.
  • Accommodations: Students with ongoing challenges (e.g., disabilities, family obligations) should contact the instructor early to discuss accommodations, per NYU Academic Affairs policies.

A.2. Communication with the Course Staff: We highly encourage you to visit us during our open-door office hours for any queries or discussions, as this is the most effective way to communicate with the course staff.

A.2. 1 Email Communication: timely and detailed email responses may not always be possible. To ensure your questions are addressed efficiently and thoroughly, we strongly encourage you to bring them to office hours or ask during class when appropriate. 

A.2.2. Office hours:  To view our available time slots, please refer to the right column of our course website.

A.2.3. Confidential matters: if you need to schedule an appointment for confidential discussions about course content or logistics, please schedule a time slot with one of the course staff members. Please refer to the 'Exceptional Circumstances' sections below. That section offers guidance on how to approach such conversations. Our discussions will be strictly limited to course-related topics and logistical concerns, and we will respectfully decline to engage in conversations extending beyond these subjects to maintain the professional and focused nature of our interactions.

A.2.4. Required notice: Please plan and provide us with sufficient notice for your requests. It is essential to understand that we may be unable to address immediate needs arising from last-minute planning. 

A.2.5. Exceptional circumstances policy: If you have a confidential matter to discuss with the course staff, please be aware that the course staff cannot assess personal issues or special requests due to exceptional circumstances, such as illness, family issues, etc. When pursuing such matters or seeking an exemption, your initial point of contact should be the Academic Affairs Office. It is necessary to provide them with the relevant documentation about your situation. The office will thoroughly review your case, considering the specifics of your circumstances. Following their evaluation, the Academic Affairs Office will communicate directly with the course staff, informing us whether an exception is warranted and offering guidance on the appropriate actions. This process ensures that all requests are considered fairly and consistently, per the university's policies and procedures. For detailed information and guidance, students are encouraged to refer to the university's policy for the undergraduate program here. This policy provides comprehensive guidelines for managing exceptional circumstances within the university framework, ensuring transparency and fairness in all decisions.

B. Course contents

This course introduces the foundations of Reinforcement Learning (RL) and its practical applications in the business world. Throughout the course, students will explore RL's core principles and techniques, understanding how you can effectively apply these methods to solve various business challenges. From theoretical concepts to real-world scenarios, the course aims to equip students with the knowledge and skills necessary to leverage RL in business settings.

B.1. Learning outcomes Upon completing this course, you will have developed the skills to:

  1. Construct autonomous agents capable of making effective decisions in various environments, including fully informed and partially observable.
  2. Design agents that can perform inference in uncertain environments and optimize actions based on diverse reward structures.
  3. Implement Reinforcement Learning (RL) techniques to address various business challenges, such as those found in finance, marketing, and other sectors.
  4. Establish a solid foundation for advanced studies in RL theory or applications, equipping you for future academic or professional pursuits in this field.

This course will have a significant programming component and class/homework exercises.

B.2. Prerequisites Formal: (i) Introduction to Computer Programming and (ii) Calculus We will also draw basic concepts from the following courses, which we will quickly review in class(i) Linear Algebra, (ii) Probability, (iii) Multivariate Calculus, and (iv) Algorithms.

B.2.1 Standing requirement: Sophomores and up

B.3. Textbook: This course does not require the purchase of a textbook. Given the novelty of many topics, current textbooks may not comprehensively address them. Instead, we will curate our materials from various online sources, including notes from other courses, publicly accessible assignments, books, and lecture notes. We will ensure you have access to these resources either through direct copies or links. Additionally, we will assign readings from selected textbooks, some of which are freely available for download online. These texts have been chosen for their relevance and depth of information, aligning closely with our course content.

  • Deep Reinforcement Learning, by Miguel Morales
  • Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (e-book available for free)
  • Artificial intelligence: A modern approach, by Russell, Stuart J., 2010, Prentice-Hall, ISBN:9780132071482

B.4. Lecture slides The course staff will upload the slide deck for each lecture shortly before the session begins. Please be aware that these slides serve primarily as visual aids to facilitate class discussions and should not be considered a substitute for the assigned readings. They complement the material rather than summarize it. As such, a thorough grasp of the course content will require engagement with the complete assigned readings.

B.5. Mathematics  At its core, AI builds on algorithms founded on fundamental mathematical concepts. This course requires familiarity with basic mathematical principles, including linearity, conditional probability, Bayes' rule, multivariate derivatives, asymptotic notation, and statistical tests. In our first lecture, we will provide a math self-assessment to gauge your understanding and offer resources for mastering these essential concepts. While theorem proofs are not a course requirement, a solid grasp of the basic mathematical notation and programming concepts related to these topics is crucial for actively engaging in and comprehending the course discussions.

B.6. Software

  • The course requires access to the Python 3 programming language environment. We recommend installing Miniconda.
  • We recommend Overleaf (free for NYU affiliates) for typing math. This platform uses Latex, which is easy to use (see a good reference guide here).
C. Coursework

Please note that this is a 2-credit course that spans over seven weeks instead of the standard 14 weeks. Despite the condensed timeline, the weekly workload for this course is comparable to a regular 4-credit course. This course's rigorous nature necessitates allocating sufficient time and focused effort to ensure that you stay on track and meet the learning objectives within the designated timeframe.

C.1. Homework policy

  • Late assignments will have their grades divided by 2 for each late day. For example, submitting the day following the deadline halves your grade, two days divides your grade by 4, and so on.
  • In unexpected circumstances where you must miss a deadline, we will drop one assignment with the lowest grade. No other exception will be allowed.
  • We will grade Homework sets based on your ability to demonstrate the applicability of the concepts you learned to solve the problems. In open questions, when you show knowledge and exercise sound judgment to arrive at a solution, we will assign full credit to that answer, even if the solution is not entirely correct.

C.2. Generative AI in coursework Generative AI is a remarkable tool in modern education, offering personalized interactions with human knowledge. Recognizing its potential to enhance learning, we encourage students to utilize Generative AI resources as an aid tool for homework assignments and preparation of student lecture content. Still, you must fully understand what you submit and write your work in your own words. However, it is important to note two important exceptions to the use of Generative AI in this course:

  1. Final Exam: The final Exam is a traditional, closed-book format within the classroom setting. This format ensures a fair assessment of individual knowledge and understanding, independent of external AI assistance or other sources.
  2. Student Lectures: While you may use Generative AI in the preparation phase, the student must personally deliver the presentation of student lectures. The use of AI avatars or voices for the presentation is not permitted. This policy aims to develop and assess your presentation skills and ability to communicate complex ideas effectively.

In all uses of Generative AI, students must adhere to academic integrity principles, ensuring that all work submitted is their own and that any AI-generated content or ideas are appropriately credited. 

C.3. Academic integrity: We aim to foster a fair and supportive learning environment at NYU. Upholding the highest standards of academic integrity is paramount in this course. The university will take any breach of academic integrity seriously, and any violations may result in severe consequences. Academic misconduct, such as plagiarism, failing to cite sources properly, or submitting work produced by others as your own, are considered serious violations. We emphasize the importance of originality and proper attribution in all your academic endeavors, as these are fundamental to the principles of intellectual honesty and scholarly practice.

C.3.1. Honor Pledge: In some cases, we may require students to sign the honor pledge and submit it with assignments and exams (in those cases, the course staff will only grade submissions with an honor pledge signature). For convenience, we will share a pledge template here: "I affirm that I will not give or receive any unauthorized help on this academic activity and that all work I submit is my own."

C.4. Collaboration policy

  • You may discuss the problem sets with other students in the class. Still, you must understand what you submit fully and write your solutions separately in your own words.
  • We will check for plagiarism using powerful algorithms that are hard to fool (please don't try).
  • As a general academic principle, students must name any sources or collaborators with whom they discussed problems.

C.5. Learning Management System Homework sets and assignments are going to be posted and announced on Brightspace.

C.6. Learning assessment

  • Exams: We will have a closed-book in-class Final Exam, which you will have 75 minutes to complete, and we will allow one two-sided cheat sheet (size 8.5x11"). The material we test on the exams is cumulative, and the Final Exam includes everything discussed during the course.
  • Problem sets: We will offer weekly problem sets that include coding and a written component.
  • Student Lecture: See "Student Lecture "
  • Participation: An evaluation of how much you have contributed to the lectures.

C.6.1. Grading

  • Weekly Problem sets (20%)
  • Final Exam (35%)
  • Student Lecture (30%)
  • Participation (15%)

C.6.2 Makeup Exam Policy: This course does not provide a makeup exam. If you cannot attend the final Exam and have an approved exception from the Academic Affairs office, you may apply to receive an 'Incomplete' grade. When a student has completed all but a single requirement or a small amount of course work, and for good reason is unable to submit the remaining coursework by the end of the semester. Academic Affairs will review the request to make sure that it meets the above criteria. Note that Academic Affairs will not approve requests to make up a significant amount of coursework. This policy will allow you to complete the course requirements during the subsequent course offering the following year. Considering this policy is essential, especially for those needing the course grade for graduation purposes. 

C.6.3 Regrade requests: Ensuring fair grading is a cornerstone of our course. If you feel that your assignment or Exam requires a re-evaluation, we ask that you visit us during office hours to discuss your concerns. A course staff member will document and present your points at our course staff meeting for a collective review. We commit to providing you with a response within 7-10 days following a comprehensive discussion of your request. Please be aware that we will meticulously reassess your entire submission when a regrade is requested. As a result, it is essential to understand that your final grade could either increase or decrease. If the student wishes to appeal the grade further due to grade miscalculations or misapplication of syllabus, area, or university policies. a formal written appeal should be submitted to the Assistant Dean for Academic Affairs.

C.6.4. Issues about performance: If you need help learning the material, contact us as soon as possible to discuss what is not working and how we can assist. It is challenging to address questions about performance at the end of the semester or after final grades are submitted. So, if you are worried about your learning, seek help early.

C.6.5. Pass/Fail Option: If you select the binary grade option (when available), to receive a passing grade, students need to (i) achieve at least 63% performance in the entire course, (ii) earn at least 41% of the grade of every assignment, exam, and project, and (iii) attend at least 90% of the lectures.

D. In-class code of conduct

D.1. Neutrality Statement:  While we acknowledge individual identities and conditions, they are immaterial for interactions, learning and performance evaluation in the class. As such, they do not influence the course staff's judgments. We acknowledge our imperfections while fully committing to the work inside and outside our classrooms to build and sustain a campus community that increasingly embraces these core values. 

D.2. Electronic device policy

  • Students must turn off and put away cell/mobile/smartphones for the duration of the class (this is NYU Shanghai's policy)
  • Laptop screens can be distracting to students sitting behind you. If you open programs unrelated to the class, e.g., email, social media, games, videos, etc., please sit at the back of the classroom.

D.3. Instructor Goals: My primary aim is to impact student learning and growth positively. To achieve this, I strive to:

  • Cultivate critical and original thinking, laying a foundation for lifelong learning through engaging lectures, thought-provoking discussions, and relevant assignments.
  • Inspire students to lead lives of discovery, satisfaction, and impactful contributions.
  • Share my passion for the subject, igniting a similar enthusiasm among students.
  • Guide students towards mastery of the subject area, extending mentorship beyond the classroom.
  • Continuously developing new and stimulating course materials, assignments, and curricular initiatives.
  • Seamlessly integrating technology in the course to augment learning experiences.

The dates below are tentative. Depending on how the lectures develop, some topics may be covered earlier or later. We will populate the list as we progress.

WeekDateTopicReadingAssignments
13/25Introduction to RLCourse: deeplearning.ai, Mathematics for AI

ACM AI in China

Mathematics for Machine Learning by Garrett Thomas (Berkeley) Coding exercise materials:

Homework 0 (written)
3/27Introduction to the course, problem solving agents, state spaces, action, rewards, Search trees, uninformed search algorithmsRequired: Russell, Norvig 3.1-3.4
24/1No Class: Spring Recess (April 1 - April 5) and Qingming HolidaySutton and Barto Ch. 3
4/3No Class: Spring Recess (April 1 - April 5) and Qingming Holiday
34/8Uniform cost search

Intro Markov Decision Processes

Sutton and Barto Ch. 3Homework 1 (Coding)

Homework 1 (written)

 

4/10Markov Decision Processes: value states, q-states, policies, action optimization, time horizons and discounted rewards 
44/13Markov Decision Processes: value states, q-states, policies, action optimization, time horizons and discounted rewardsSutton and Barto Ch. 4 
4/15Bellman equationsSutton and Barto Ch. 4
4/17Value Iteration, fixpoint solutions, the convergence of value iterationSutton and Barto Ch. 4Homework 2 (written)

Homework 2 (Coding)

54/22Policy evaluation, policy extraction, policy iteration, the multi-armed bandit problem
4/24Multi-armed Bandit problems, Reinforcement Learning. Offline planning vs. online learning, model-based vs. model-free learning, passive vs. active learning, Direct evaluation Q-learning, Linear Approximate Q-LearningSutton and Barto Ch. 2

Sutton and Barto Ch. 6.1,2,5

Homework 3 (written)

Homework 3 (Coding) Student Lecture

64/29RL application in FinanceNevmyvaka,  Feng, and Michael Kearns. 2006. Reinforcement learning for optimized trade execution. In Proceedings of the 23rd international conference on Machine learning (ICML '06)
 5/1China Labor Day Holiday (No classes scheduled)
 7 5/6RL application in Fianance
5/8RL application in transportation
85/13Deep Reinforcement LearningDeep Reinforcement Learning: Pong from Pixels
5/15Final Exam
5/15Student Lecture Deadline

Deadline: See Calendar or Brightspace.

Why: This project aims to expand your familiarity with business applications of Reinforcement Learning (RL). The project will also exercise your ability to communicate what you have learned clearly to a broad audience, a critical skill for job interviews or technical meetings.

What: Your task is to study a business application of Reinforcement Learning, ideally, one that you are particularly interested in, e.g., because it is related to your major, or you have worked on a project in which RL is applicable, or there is a dataset/problem you care about, etc. There are hundreds of RL applications in many different domains. This video talks about applying RL in the real world.

How: Your task is to produce a short lecture that teaches the application you have learned. The format of your presentation is a 5 to 7-minute video. You may optionally include a set of lecture notes, slides, etc. The critical goal is to keep the lecture self-contained and inform readers interested in that application, including those unfamiliar with your major. Your work is not meant to be a summary or a recitation of the paper you read. Instead the goal is to inform potential readers about your views of the method, especially about not readily available aspects in a quick first reading of the paper. You should help your reader to read between the lines and think critically about that piece of work, synthesizing the underlying ideas and your perspectives. You may assume that the viewers of your lecture have taken this course. Our Lecture "RL for Finance" may be used as a reference on how to present an application based on a paper or technical report. In your lecture, you will address the following points:

    • A comprehensive description of the domain and the problem you are studying (i.e., a student who is not familiar with your major should be able to understand your explanation. For example, if you're a finance major studying an application to "option valuation", you need to assume that some students who are not in finance may not know what "options" are.)
    • Insights: What are the properties of the problem that make it a good candidate for being solved with Reinforcement Learning? Why do you expect that RL would work? Why is RL better than other methods in this context?
    • Modeling: Describe the components of the RL formulation, i.e., state-space, action space, transition function, etc., as we did in class. What are the modeling assumptions? Are they realistic?
    • Algorithm: What algorithms are used to solve the problem, e.g., value iteration, Q-learning, approximate Q-values, Deep RL, etc. Is the approach model-based or model-free, is the training is active or passive, etc.? What are the exploration strategies?
    • Results and potential directions: What are the results? Are they significant and impactful? Do they improve on traditional approaches or baselines (and why)? What are the weaknesses of the methods? What are some promising further questions, and how could they be pursued?

The following short document below may serve as a reference for critical reading of a scientific paper or technical report: https://www.eecs.harvard.edu/~michaelm/postscripts/ReadPaper.pdf

Submission and privacy: Submit your work on Brightspace in mp4 format. Be careful with the size of your file, as Brighspace has trouble uploading a very large one. You may request that we keep your work confidential, otherwise we will add it to the video library of our course website after a thorough review by the instructor.

Video examples See our Video Gallery section

Below I suggest some applications based on the student's majors. Feel free, however, to find an application that is not listed here. Harvard Business Review (April 21, 2021): Applications of Reinforcement Learning in Business

Business and Finance
Computer/Data Science

This page lists many more applications in several domains, such as education, health care, business management, etc. The course staff is here to help. Feel free to contact the instructor or the TA to discuss your lecture.

Post Your Video

We encourage you to utilize our office hours for any queries or discussions. This approach is the most effective way to communicate with course staff, especially given the challenge of promptly responding to the high volume of emails. Office hours allow us to address your concerns quickly and comprehensively.

Please use the form provided below if you need to book a 15-minute office hour slot, schedule an appointment for confidential discussions about course content or logistics, or if you have a brief question, which we can quickly resolve via email. For any special requests related to exceptions to our syllabus rules or deadlines, refer to the 'Exceptional circumstances' section of the syllabus. This section offers clear guidelines on how to proceed with such requests.






    For the final project, you are going to produce a 5-minute video where you present how you applied the techniques introduced in this course to some data set of your choosing. The goal of the project is to prepare you to apply state-of-the-art methods to an application. The video you will produce is going to be posted on the course website, and you may use it as a demonstration of your work for future employers as part of your portfolio.
    Your first task is to pick a project topic, find a relevant dataset, and write a project proposal. You are to work in teams of two. You may apply any techniques to just about any domain. The best class projects come from students working on topics that they're excited about. So, pick something that you can get excited and passionate about! Be brave rather than timid, and do feel free to propose ambitious things that you're excited about (but be mindful of the limited time frame to complete the project).
    If you're already working on a research project that AI might be applicable to, then working out how to apply AI techniques to it will often make a very good project topic. Similarly, if you currently work in industry and have an application on which AI might help, that could also make a great project.
    Summary
    Teams of two;  Be creative – think of new problems that you can tackle using the techniques you have learned.
    • Scope: ~40 hours/person
    • There are four deliverables:
      • Project proposal (Reaction paper): due TBD.
      • Short Proposal presentation in class. Course staff feedback TDB
      • Project checkpoint - Milestone due TBD
      • Final video: due TBD
    More details of the project deliverables below.
    Project proposal

    The proposal consists of a 1-3 page document whose structure consists of answering the following questions:

    • What is the problem you are solving? Has it been addressed before?
    • What data will you use (how will you get it)?
    • What work do you plan to do to complete the project?
    • Which algorithms/techniques/models you plan to use/develop? Be as specific as you can.
    • Who will you evaluate your method? How will you test it? How will you measure success?
    • What do you expect to submit/accomplish by the end of the quarter?
    • Include a list of references, e.g., research papers, websites, etc.

    The projects should be original, i.e., haven't been done by anyone, including you, in the past, and you should observe the academic integrity policy of this course. Submit your proposal in PDF format on NYU Classes.

    Proposal presentation Each group will prepare a short presentation, between 3-5 minutes to present their proposal. The course staff will going to provide feedback based on your presentation, either during your presentation or offline afterwards.

    Project Milestone The project milestone consists of a 2-3 page document is meant for the course staff to check in with you regarding your progress. Think of this as a draft of your final project but without your major results. After this step you will have two weeks to complete the project.

    • We expect that you have completed 40% of the project
    • Provide a complete picture of your project even if certain key parts have not yet been implemented/solved.
    • Include the parts of your project which have been completed so far, such as:
      • Revised introduction of your problem
      • Review of the relevant prior work
      • Description of the data collection process
      • Description or plots of any initial findings or summary statistics from your dataset
      • Description of any background necessary for your problem
      • Formal description of any important methods used
      • Description of general difficulties with your problem which bear elaboration
    • Make sure to outline the parts which have not yet been completed so that it is clear what you plan to do for the final version.
    Submit your milestone in PDF format on NYU Classes. Final project format Video presentation, at most 5 minutes, to be submitted in mp4 format. Optionally you may also submit an accompanying document, and open source code, open source dataset (make sure you have permission to share) for replication purposes.

    Evaluation

    Projects will be evaluated based on:

    • The technical quality of the work: Does the technical material make sense? Are the methods tried reasonable? Are the proposed applications clever and interesting? Do the authors convey novel insights about the problem?
    • Significance: Did the authors choose an interesting or a "real" problem to work on, or only a small "toy" problem that can be solved with a trivial application of one of Python library functions? Is this work likely to be useful and/or have impact?
    • The novelty of the work, completeness of the solution, and the clarity of the presentation.

    In order for the course staff to be able to assign individual scores, we ask you to write down a brief summary of the individual contributions of each of the team members in the format outlined below at the end of each report: Example: ---------------- Wei: Plotted graphs during data analysis, collected the data, preliminary data analysis, problem formulation, report writing Chris: Proposed the algorithm, Coding up the algorithm, running tests, tabulating final results ---------------

    Video examples:  See our Video Gallery section Data sources You are free to find any dataset you're interested in (as long as it allows you to apply a technique you learned in the course. Here are a few data sources you can browse to search for a dataset:

     Pacman Search


    All those colored walls, Mazes give Pacman the blues, So teach him to search.

    Introduction

    In this project, your Pacman agent will find paths through his maze world, both to reach a particular location and to collect food efficiently. You will build general search algorithms and apply them to Pacman scenarios. This project includes an autograder for you to grade your answers on your machine. This can be run with the command:

    python autograder.py

    The code for this project consists of several Python files, some of which you will need to read and understand in order to complete the assignment, and some of which you can ignore. You can download all the code and supporting files as a zip archive.

    Files you'll edit:
    search.pyWhere all of your search algorithms will reside.
    searchAgents.pyWhere all of your search-based agents will reside.
    Files you might want to look at:
    pacman.pyThe main file that runs Pacman games. This file describes a Pacman GameState type, which you use in this project.
    game.pyThe logic behind how the Pacman world works. This file describes several supporting types like AgentState, Agent, Direction, and Grid.
    util.pyUseful data structures for implementing search algorithms.
    Supporting files you can ignore:
    graphicsDisplay.pyGraphics for Pacman
    graphicsUtils.pySupport for Pacman graphics
    textDisplay.pyASCII graphics for Pacman
    ghostAgents.pyAgents to control ghosts
    keyboardAgents.pyKeyboard interfaces to control Pacman
    layout.pyCode for reading layout files and storing their contents
    autograder.pyProject autograder
    testParser.pyParses autograder test and solution files
    testClasses.pyGeneral autograding test classes
    test_cases/Directory containing the test cases for each question
    searchTestClasses.pyProject 1 specific autograding test classes

    Files to Edit and Submit: You will fill in portions of search.py and searchAgents.py during the assignment. You should submit these files with your code and comments. Please do not change the other files in this distribution or submit any of our original files other than these files. Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's judgements -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work. Academic Dishonesty: We will be checking your code against other submissions in the class and aginst code posted online for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us. Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours and review sessions are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.


    Welcome to Pacman

    After downloading the code (zip archive), unzipping it, and changing to the directory, you should be able to play a game of Pacman by typing the following at the command line:

    python pacman.py

    Pacman lives in a shiny blue world of twisting corridors and tasty round treats. Navigating this world efficiently will be Pacman's first step in mastering his domain. The simplest agent in searchAgents.py is called the GoWestAgent, which always goes West (a trivial reflex agent). This agent can occasionally win:

    python pacman.py --layout testMaze --pacman GoWestAgent

    But, things get ugly for this agent when turning is required:

    python pacman.py --layout tinyMaze --pacman GoWestAgent

    If Pacman gets stuck, you can exit the game by typing CTRL-c into your terminal. Soon, your agent will solve not only tinyMaze, but any maze you want. Note that pacman.py supports a number of options that can each be expressed in a long way (e.g., --layout) or a short way (e.g., -l). You can see the list of all options and their default values via:

    python pacman.py -h

    Also, all of the commands that appear in this project also appear in commands.txt, for easy copying and pasting. In UNIX/Mac OS X, you can even run all these commands in order with bash commands.txt.


    Question 1 : Finding a Fixed Food Dot using Depth First Search

    In searchAgents.py, you'll find a fully implemented SearchAgent, which plans out a path through Pacman's world and then executes that path step-by-step. The search algorithms for formulating a plan are not implemented -- that's your job. As you work through the following questions, you might find it useful to refer to the object glossary in the appendix of this page. First, test that the SearchAgent is working correctly by running:

    python pacman.py -l tinyMaze -p SearchAgent -a fn=tinyMazeSearch

    The command above tells the SearchAgent to use tinyMazeSearch as its search algorithm, which is implemented in search.py. Pacman should navigate the maze successfully. Now it's time to write full-fledged generic search functions to help Pacman plan routes! Remember that a search node must contain not only a state but also the information necessary to reconstruct the path (plan) which gets to that state. Important note: All of your search functions need to return a list of actions that will lead the agent from the start to the goal. These actions all have to be legal moves (valid directions, no moving through walls). Important note: Make sure to use the Stack, Queue and PriorityQueue data structures provided to you in util.py! These data structure implementations have particular properties which are required for compatibility with the autograder. Hint: Each algorithm is very similar. Algorithms for DFS and BFS differ only in the details of how the fringe is managed. So, concentrate on getting DFS right and the rest should be relatively straightforward. Indeed, one possible implementation requires only a single generic search method which is configured with an algorithm-specific queuing strategy. (Your implementation need not be of this form to receive full credit). Implement the depth-first search (DFS) algorithm in the depthFirstSearch function in search.py. To make your algorithm complete, write the graph search version of DFS, which avoids expanding any already visited states. Your code should quickly find a solution for:

    python pacman.py -l tinyMaze -p SearchAgent
    python pacman.py -l mediumMaze -p SearchAgent
    python pacman.py -l bigMaze -z .5 -p SearchAgent

    The Pacman board will show an overlay of the states explored, and the order in which they were explored (brighter red means earlier exploration). Is the exploration order what you would have expected? Does Pacman actually go to all the explored squares on his way to the goal? Hint: If you use a Stack as your data structure, the solution found by your DFS algorithm for mediumMaze should have a length of 130 (provided you push successors onto the fringe in the order provided by getSuccessors; you might get 246 if you push them in the reverse order). Is this a least cost solution? If not, think about what depth-first search is doing wrong.


    Question 2: Breadth First Search

    Implement the breadth-first search (BFS) algorithm in the breadthFirstSearch function in search.py. Again, write a graph search algorithm that avoids expanding any already visited states. Test your code the same way you did for depth-first search.

    python pacman.py -l mediumMaze -p SearchAgent -a fn=bfs
    python pacman.py -l bigMaze -p SearchAgent -a fn=bfs -z .5

    Does BFS find a least cost solution? If not, check your implementation. Hint: If Pacman moves too slowly for you, try the option --frameTime 0. Note: If you've written your search code generically, your code should work equally well for the eight-puzzle search problem without any changes.

    python eightpuzzle.py
    

    Question 3: Varying the Cost Function

    While BFS will find a fewest-actions path to the goal, we might want to find paths that are "best" in other senses. Consider mediumDottedMaze and mediumScaryMaze. By changing the cost function, we can encourage Pacman to find different paths. For example, we can charge more for dangerous steps in ghost-ridden areas or less for steps in food-rich areas, and a rational Pacman agent should adjust its behavior in response. Implement the uniform-cost graph search algorithm in the uniformCostSearch function in search.py. We encourage you to look through util.py for some data structures that may be useful in your implementation. You should now observe successful behavior in all three of the following layouts, where the agents below are all UCS agents that differ only in the cost function they use (the agents and cost functions are written for you):

    python pacman.py -l mediumMaze -p SearchAgent -a fn=ucs
    python pacman.py -l mediumDottedMaze -p StayEastSearchAgent
    python pacman.py -l mediumScaryMaze -p StayWestSearchAgent

    Note: You should get very low and very high path costs for the StayEastSearchAgent and StayWestSearchAgent respectively, due to their exponential cost functions (see searchAgents.py for details).


    Submission

    In order to submit your project, please upload the following files to Homework 1 (coding) on Brightspace: search.py and searchAgents.py. Please do not upload the files in a zip file or a directory as the autograder will not work if you do so.

    MDPs


    Pacman wants reward. Should he eat or should he run?

    Introduction

    In this project, you will implement the value iteration algorithm for MDPs. You will test your agents first on Gridworld (from class). As in previous projects, this project includes an autograder for you to grade your solutions on your machine. This can be run on all questions with the command:

    python autograder.py

    It can be run for one particular question, such as q2, by:

    python autograder.py -q q2

    It can be run for one particular test by commands of the form:

    python autograder.py -t test_cases/q2/1-bridge-grid

    The code for this project contains the following files, available as a zip archive.

    Files you'll edit:
    valueIterationAgents.pyA value iteration agent for solving known MDPs.
    qlearningAgents.pyQ-learning agents for Gridworld, Crawler and Pacman.
    analysis.pyA file to put your answers to questions given in the project.
    Files you should read but NOT edit:
    mdp.pyDefines methods on general MDPs.
    learningAgents.pyDefines the base classes ValueEstimationAgent and QLearningAgent, which your agents will extend.
    util.pyUtilities, including util.Counter, which is particularly useful for Q-learners.
    gridworld.pyThe Gridworld implementation.
    featureExtractors.pyClasses for extracting features on (state, action) pairs. Used for the approximate Q-learning agent (in qlearningAgents.py).
    Files you can ignore:
    environment.pyAbstract class for general reinforcement learning environments. Used by gridworld.py.
    graphicsGridworldDisplay.pyGridworld graphical display.
    graphicsUtils.pyGraphics utilities.
    textGridworldDisplay.pyPlug-in for the Gridworld text interface.
    crawler.pyThe crawler code and test harness. You will run this but not edit it.
    graphicsCrawlerDisplay.pyGUI for the crawler robot.
    autograder.pyProject autograder
    testParser.pyParses autograder test and solution files
    testClasses.pyGeneral autograding test classes
    test_cases/Directory containing the test cases for each question
    reinforcementTestClasses.pyProject 3 specific autograding test classes

    Files to Edit and Submit: You will fill in portions of valueIterationAgents.py, qlearningAgents.py, and analysis.py during the assignment. Please do not change the other files in this distribution or submit any of our original files other than these file. Note: You only need to submit reinforcement.token, generated by running submission_autograder.py. It contains the evaluation results from your local autograder, and a copy of all your code. You do not need to submit any other files. Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's judgements -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work. Academic Dishonesty: We will be checking your code against existing and other code in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us. Getting Help: You are not alone! If you find yourself stuck on something, group with other students or contact the course staff for help. Office hours are available for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.


    MDPs

    To get started, run Gridworld in manual control mode, which uses the arrow keys:

    python gridworld.py -m

    You will see the two-exit layout from class. The blue dot is the agent. Note that when you press up, the agent only actually moves north 80% of the time. Such is the life of a Gridworld agent! You can control many aspects of the simulation. A full list of options is available by running:

    python gridworld.py -h

    The default agent moves randomly

    python gridworld.py -g MazeGrid

    You should see the random agent bounce around the grid until it happens upon an exit. Not the finest hour for an AI agent. Note: The Gridworld MDP is such that you first must enter a pre-terminal state (the double boxes shown in the GUI) and then take the special 'exit' action before the episode actually ends (in the true terminal state called TERMINAL_STATE, which is not shown in the GUI). If you run an episode manually, your total return may be less than you expected, due to the discount rate (-d to change; 0.9 by default). Look at the console output that accompanies the graphical output (or use -t for all text). You will be told about each transition the agent experiences (to turn this off, use -q). As in Pacman, positions are represented by (x,y) Cartesian coordinates and any arrays are indexed by [x][y], with 'north' being the direction of increasing y, etc. By default, most transitions will receive a reward of zero, though you can change this with the living reward option (-r).


    Question 1: Value Iteration

    Recall the value iteration state update equation: bellman Write a value iteration agent in ValueIterationAgent, which has been partially specified for you in valueIterationAgents.py. Your value iteration agent is an offline planner, not a reinforcement learning agent, and so the relevant training option is the number of iterations of value iteration it should run (option -i) in its initial planning phase. ValueIterationAgent takes an MDP on construction and runs value iteration for the specified number of iterations before the constructor returns. Value iteration computes k-step estimates of the optimal values, Vk. In addition to running value iteration, implement the following methods for ValueIterationAgent using Vk.

    • computeActionFromValues(state) computes the best action according to the value function given by self.values.
    • computeQValueFromValues(state, action) returns the Q-value of the (state, action) pair given by the value function given by self.values.

    These quantities are all displayed in the GUI: values are numbers in squares, Q-values are numbers in square quarters, and policies are arrows out from each square. Important: Use the "batch" version of value iteration where each vector Vk is computed from a fixed vector Vk-1 (like in lecture), not the "online" version where one single weight vector is updated in place. This means that when a state's value is updated in iteration k based on the values of its successor states, the successor state values used in the value update computation should be those from iteration k-1 (even if some of the successor states had already been updated in iteration k). The difference is discussed in Sutton & Barto in the 6th paragraph of chapter 4.1. Note: A policy synthesized from values of depth k (which reflect the next k rewards) will actually reflect the next k+1 rewards (i.e. you return πk+1). Similarly, the Q-values will also reflect one more reward than the values (i.e. you return Qk+1).

    You should return the synthesized policy πk+1

    Hint: You may optionally use the util.Counter class in util.py, which is a dictionary with a default value of zero. However, be careful with argMax: the actual argmax you want may be a key not in the counter! Note: Make sure to handle the case when a state has no available actions in an MDP (think about what this means for future rewards). To test your implementation, run the autograder:

    python autograder.py -q q1

    The following command loads your ValueIterationAgent, which will compute a policy and execute it 10 times. Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state (V(start), which you can read off of the GUI) and the empirical resulting average reward (printed after the 10 rounds of execution finish) are quite close.

    python gridworld.py -a value -i 100 -k 10

    Hint: On the default BookGrid, running value iteration for 5 iterations should give you this output:

    python gridworld.py -a value -i 5

    value iteration with k=5 Grading: Your value iteration agent will be graded on a new grid. We will check your values, Q-values, and policies after fixed numbers of iterations and at convergence (e.g. after 100 iterations).


    Question 2: Bridge Crossing Analysis

    BridgeGrid is a grid world map with the a low-reward terminal state and a high-reward terminal state separated by a narrow "bridge", on either side of which is a chasm of high negative reward. The agent starts near the low-reward state. With the default discount of 0.9 and the default noise of 0.2, the optimal policy does not cross the bridge. Change only ONE of the discount and noise parameters so that the optimal policy causes the agent to attempt to cross the bridge. Put your answer in question2() of analysis.py. (Noise refers to how often an agent ends up in an unintended successor state when they perform an action.) The default corresponds to:

    python gridworld.py -a value -i 100 -g BridgeGrid --discount 0.9 --noise 0.2

    value iteration with k=100 Grading: We will check that you only changed one of the given parameters, and that with this change, a correct value iteration agent should cross the bridge. To check your answer, run the autograder:

    python autograder.py -q q2

    Question 3: Policies

    Consider the DiscountGrid layout, shown below. This grid has two terminal states with positive payoff (in the middle row), a close exit with payoff +1 and a distant exit with payoff +10. The bottom row of the grid consists of terminal states with negative payoff (shown in red); each state in this "cliff" region has payoff -10. The starting state is the yellow square. We distinguish between two types of paths: (1) paths that "risk the cliff" and travel near the bottom row of the grid; these paths are shorter but risk earning a large negative payoff, and are represented by the red arrow in the figure below. (2) paths that "avoid the cliff" and travel along the top edge of the grid. These paths are longer but are less likely to incur huge negative payoffs. These paths are represented by the green arrow in the figure below. DiscountGrid In this question, you will choose settings of the discount, noise, and living reward parameters for this MDP to produce optimal policies of several different types. Your setting of the parameter values for each part should have the property that, if your agent followed its optimal policy without being subject to any noise, it would exhibit the given behavior. If a particular behavior is not achieved for any setting of the parameters, assert that the policy is impossible by returning the string 'NOT POSSIBLE'. Here are the optimal policy types you should attempt to produce:

    1. Prefer the close exit (+1), risking the cliff (-10)
    2. Prefer the close exit (+1), but avoiding the cliff (-10)
    3. Prefer the distant exit (+10), risking the cliff (-10)
    4. Prefer the distant exit (+10), avoiding the cliff (-10)
    5. Avoid both exits and the cliff (so an episode should never terminate)

    Note: You can run q3 by the comment down below with the discount rate (-d), noise (-n), living reward (-r) options. If the window if too big for your computer, try to change the window size option (-w) to a smaller value, like 150 or 120.

    python gridworld.py -a value -g DiscountGrid

    To check your answers, run the autograder:

    python autograder.py -q q3

    question3a() through question3e() should each return a 3-item tuple of (discount, noise, living reward) in analysis.py. Note: You can check your policies in the GUI. For example, using a correct answer to 3(a), the arrow in (0,1) should point east, the arrow in (1,1) should also point east, and the arrow in (2,1) should point north. Note: On some machines you may not see an arrow. In this case, press a button on the keyboard to switch to qValue display, and mentally calculate the policy by taking the arg max of the available qValues for each state. Grading: We will check that the desired policy is returned in each case.


    Question 4: Asynchronous Value Iteration

    Write a value iteration agent in AsynchronousValueIterationAgent, which has been partially specified for you in valueIterationAgents.py. Your value iteration agent is an offline planner, not a reinforcement learning agent, and so the relevant training option is the number of iterations of value iteration it should run (option -i) in its initial planning phase. AsynchronousValueIterationAgent takes an MDP on construction and runs cyclic value iteration (described in the next paragraph) for the specified number of iterations before the constructor returns. Note that all this value iteration code should be placed inside the constructor (__init__ method). The reason this class is called AsynchronousValueIterationAgent is because we will update only one state in each iteration, as opposed to doing a batch-style update. Here is how cyclic value iteration works. In the first iteration, only update the value of the first state in the states list. In the second iteration, only update the value of the second. Keep going until you have updated the value of each state once, then start back at the first state for the subsequent iteration. If the state picked for updating is terminal, nothing happens in that iteration. You can implement it as indexing into the states variable defined in the code skeleton. As a reminder, here's the value iteration state update equation: bellman Value iteration iterates a fixed-point equation, as discussed in class. It is also possible to update the state values in different ways, such as in a random order (i.e., select a state randomly, update its value, and repeat) or in a batch style (as in Q1). In Q4, we will explore another technique. AsynchronousValueIterationAgent inherits from ValueIterationAgent from Q1, so the only method you need to implement is runValueIteration. Since the superclass constructor calls runValueIteration, overriding it is sufficient to change the agent's behavior as desired. Note: Make sure to handle the case when a state has no available actions in an MDP (think about what this means for future rewards). To test your implementation, run the autograder. It should take less than a second to run. If it takes much longer, you may run into issues later in the project, so make your implementation more efficient now.

    python autograder.py -q q4

    The following command loads your AsynchronousValueIterationAgent in the Gridworld, which will compute a policy and execute it 10 times. Press a key to cycle through values, Q-values, and the simulation. You should find that the value of the start state (V(start), which you can read off of the GUI) and the empirical resulting average reward (printed after the 10 rounds of execution finish) are quite close.

    python gridworld.py -a asynchvalue -i 1000 -k 10

    Grading: Your value iteration agent will be graded on a new grid. We will check your values, Q-values, and policies after fixed numbers of iterations and at convergence (e.g., after 1000 iterations).


    Question 5: Prioritized Sweeping Value Iteration

    You will now implement PrioritizedSweepingValueIterationAgent, which has been partially specified for you in valueIterationAgents.py. Note that this class derives from AsynchronousValueIterationAgent, so the only method that needs to change is runValueIteration, which actually runs the value iteration. Prioritized sweeping attempts to focus updates of state values in ways that are likely to change the policy. For this project, you will implement a simplified version of the standard prioritized sweeping algorithm, which is described in this paper. We've adapted this algorithm for our setting. First, we define the predecessors of a state s as all states that have a nonzero probability of reaching s by taking some action a. Also, theta, which is passed in as a parameter, will represent our tolerance for error when deciding whether to update the value of a state. Here's the algorithm you should follow in your implementation.

    • Compute predecessors of all states.
    • Initialize an empty priority queue.
    • For each non-terminal state s, do: (note: to make the autograder work for this question, you must iterate over states in the order returned by self.mdp.getStates())
      • Find the absolute value of the difference between the current value of s in self.values and the highest Q-value across all possible actions from s (this represents what the value should be); call this number diff. Do NOT update self.values[s] in this step.
      • Push s into the priority queue with priority -diff (note that this is negative). We use a negative because the priority queue is a min heap, but we want to prioritize updating states that have a higher error.
    • For iteration in 0, 1, 2, ..., self.iterations - 1, do:
      • If the priority queue is empty, then terminate.
      • Pop a state s off the priority queue.
      • Update s's value (if it is not a terminal state) in self.values.
      • For each predecessor p of s, do:
        • Find the absolute value of the difference between the current value of p in self.values and the highest Q-value across all possible actions from p (this represents what the value should be); call this number diff. Do NOT update self.values[p] in this step.
        • If diff > theta, push p into the priority queue with priority -diff (note that this is negative), as long as it does not already exist in the priority queue with equal or lower priority. As before, we use a negative because the priority queue is a min heap, but we want to prioritize updating states that have a higher error.

    A couple of important notes on implementation:

    • When you compute predecessors of a state, make sure to store them in a set, not a list, to avoid duplicates.
    • Please use util.PriorityQueue in your implementation. The update method in this class will likely be useful; look at its documentation.

    To test your implementation, run the autograder. It should take about 1 second to run. If it takes much longer, you may run into issues later in the project, so make your implementation more efficient now.

    python autograder.py -q q5

    You can run the PrioritizedSweepingValueIterationAgen in the Gridworld using the following command.

    python gridworld.py -a priosweepvalue -i 1000

    Grading: Your prioritized sweeping value iteration agent will be graded on a new grid. We will check your values, Q-values, and policies after fixed numbers of iterations and at convergence (e.g., after 1000 iterations).


    Submission

    Please upload the following files to Homework 2 (coding) on Brightspace: valueIterationAgents.pyqlearningAgents.py and analysis.py.

    Please do not upload the files in a zip file or a directory as the autograder will not work if you do so.

    RL

    Introduction

    In this project, you will implement Q-learning. You will test your agents first on Gridworld (from class), then apply them to a simulated robot controller (Crawler) and Pacman. As in previous projects, this project includes an autograder for you to grade your solutions on your machine. This can be run on all questions with the command:

    python autograder.py

    It can be run for one particular question, such as q2, by:

    python autograder.py -q q2

    It can be run for one particular test by commands of the form:

    python autograder.py -t test_cases/q2/1-bridge-grid

    The code for this project contains the following files, available as a zip archive.

    Files you'll edit:
    valueIterationAgents.pyA value iteration agent for solving known MDPs.
    qlearningAgents.pyQ-learning agents for Gridworld, Crawler and Pacman.
    analysis.pyA file to put your answers to questions given in the project.
    Files you should read but NOT edit:
    mdp.pyDefines methods on general MDPs.
    learningAgents.pyDefines the base classes ValueEstimationAgent and QLearningAgent, which your agents will extend.
    util.pyUtilities, including util.Counter, which is particularly useful for Q-learners.
    gridworld.pyThe Gridworld implementation.
    featureExtractors.pyClasses for extracting features on (state, action) pairs. Used for the approximate Q-learning agent (in qlearningAgents.py).
    Files you can ignore:
    environment.pyAbstract class for general reinforcement learning environments. Used by gridworld.py.
    graphicsGridworldDisplay.pyGridworld graphical display.
    graphicsUtils.pyGraphics utilities.
    textGridworldDisplay.pyPlug-in for the Gridworld text interface.
    crawler.pyThe crawler code and test harness. You will run this but not edit it.
    graphicsCrawlerDisplay.pyGUI for the crawler robot.
    autograder.pyProject autograder
    testParser.pyParses autograder test and solution files
    testClasses.pyGeneral autograding test classes
    test_cases/Directory containing the test cases for each question
    reinforcementTestClasses.pyProject 3 specific autograding test classes

    Files to Edit and Submit: You will fill in portions of valueIterationAgents.py, qlearningAgents.py, and analysis.py during the assignment. Please do not change the other files in this distribution or submit any of our original files other than these file. Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. However, the correctness of your implementation -- not the autograder's judgements -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work. Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us. Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours and review sessions are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.


    Question 1 : Q-Learning

    You will now write a Q-learning agent, which does very little on construction, but instead learns by trial and error from interactions with the environment through its update(state, action, nextState, reward) method. A stub of a Q-learner is specified in QLearningAgent in qlearningAgents.py, and you can select it with the option '-a q'. For this question, you must implement the update, computeValueFromQValues, getQValue, and computeActionFromQValues methods. Note: For computeActionFromQValues, you should break ties randomly for better behavior. The random.choice() function will help. In a particular state, actions that your agent hasn't seen before still have a Q-value, specifically a Q-value of zero, and if all of the actions that your agent has seen before have a negative Q-value, an unseen action may be optimal. Important: Make sure that in your computeValueFromQValues and computeActionFromQValues functions, you only access Q values by calling getQValue . This abstraction will be useful for question 5 when you override getQValue to use features of state-action pairs rather than state-action pairs directly. With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard:

    python gridworld.py -a q -k 5 -m

    Recall that -k will control the number of episodes your agent gets to learn. Watch how the agent learns about the state it was just in, not the one it moves to, and "leaves learning in its wake." Hint: to help with debugging, you can turn off noise by using the --noise 0.0 parameter (though this obviously makes Q-learning less interesting). If you manually steer Pacman north and then east along the optimal path for four episodes, you should see the following Q-values: QLearning Grading: We will run your Q-learning agent and check that it learns the same Q-values and policy as our reference implementation when each is presented with the same set of examples. To grade your implementation, run the autograder:

    python autograder.py -q q1

    Question 2 : Epsilon Greedy

    Complete your Q-learning agent by implementing epsilon-greedy action selection in getAction, meaning it chooses random actions an epsilon fraction of the time, and follows its current best Q-values otherwise. Note that choosing a random action may result in choosing the best action - that is, you should not choose a random sub-optimal action, but rather any random legal action. You can choose an element from a list uniformly at random by calling the random.choice function. You can simulate a binary variable with probability p of success by using util.flipCoin(p), which returns True with probability p and False with probability 1-p. After implementing the getAction method, observe the following behavior of the agent in gridworld (with epsilon = 0.3).

    python gridworld.py -a q -k 100

    Your final Q-values should resemble those of your value iteration agent, especially along well-traveled paths. However, your average returns will be lower than the Q-values predict because of the random actions and the initial learning phase. You can also observe the following simulations for different epsilon values. Does that behavior of the agent match what you expect?

    python gridworld.py -a q -k 100 --noise 0.0 -e 0.1
    python gridworld.py -a q -k 100 --noise 0.0 -e 0.9

    To test your implementation, run the autograder:

    python autograder.py -q q2

    With no additional code, you should now be able to run a Q-learning crawler robot:

    python crawler.py

    If this doesn't work, you've probably written some code too specific to the GridWorld problem and you should make it more general to all MDPs. This will invoke the crawling robot from class using your Q-learner. Play around with the various learning parameters to see how they affect the agent's policies and actions. Note that the step delay is a parameter of the simulation, whereas the learning rate and epsilon are parameters of your learning algorithm, and the discount factor is a property of the environment.


    Question 3 : Bridge Crossing Revisited

    First, train a completely random Q-learner with the default learning rate on the noiseless BridgeGrid for 50 episodes and observe whether it finds the optimal policy.

    python gridworld.py -a q -k 50 -n 0 -g BridgeGrid -e 1

    Now try the same experiment with an epsilon of 0. Is there an epsilon and a learning rate for which it is highly likely (greater than 99%) that the optimal policy will be learned after 50 iterations? question8() in analysis.py should return EITHER a 2-item tuple of (epsilon, learning rate) OR the string 'NOT POSSIBLE' if there is none. Epsilon is controlled by -e, learning rate by -l. Note: Your response should be not depend on the exact tie-breaking mechanism used to choose actions. This means your answer should be correct even if for instance we rotated the entire bridge grid world 90 degrees. To grade your answer, run the autograder:

    python autograder.py -q q3

    Question 4 : Q-Learning and Pacman

    Time to play some Pacman! Pacman will play games in two phases. In the first phase, training, Pacman will begin to learn about the values of positions and actions. Because it takes a very long time to learn accurate Q-values even for tiny grids, Pacman's training games run in quiet mode by default, with no GUI (or console) display. Once Pacman's training is complete, he will enter testing mode. When testing, Pacman's self.epsilon and self.alpha will be set to 0.0, effectively stopping Q-learning and disabling exploration, in order to allow Pacman to exploit his learned policy. Test games are shown in the GUI by default. Without any code changes you should be able to run Q-learning Pacman for very tiny grids as follows:

    python pacman.py -p PacmanQAgent -x 2000 -n 2010 -l smallGrid

    Note that PacmanQAgent is already defined for you in terms of the QLearningAgent you've already written. PacmanQAgent is only different in that it has default learning parameters that are more effective for the Pacman problem (epsilon=0.05, alpha=0.2, gamma=0.8). You will receive full credit for this question if the command above works without exceptions and your agent wins at least 80% of the time. The autograder will run 100 test games after the 2000 training games. Hint: If your QLearningAgent works for gridworld.py and crawler.py but does not seem to be learning a good policy for Pacman on smallGrid, it may be because your getAction and/or computeActionFromQValues methods do not in some cases properly consider unseen actions. In particular, because unseen actions have by definition a Q-value of zero, if all of the actions that have been seen have negative Q-values, an unseen action may be optimal. Beware of the argmax function from util.Counter! Note: To grade your answer, run:

    python autograder.py -q q4

    Note: If you want to experiment with learning parameters, you can use the option -a, for example -a epsilon=0.1,alpha=0.3,gamma=0.7. These values will then be accessible as self.epsilon, self.gamma and self.alpha inside the agent. Note: While a total of 2010 games will be played, the first 2000 games will not be displayed because of the option -x 2000, which designates the first 2000 games for training (no output). Thus, you will only see Pacman play the last 10 of these games. The number of training games is also passed to your agent as the option numTraining. Note: If you want to watch 10 training games to see what's going on, use the command:

    python pacman.py -p PacmanQAgent -n 10 -l smallGrid -a numTraining=10

    During training, you will see output every 100 games with statistics about how Pacman is faring. Epsilon is positive during training, so Pacman will play poorly even after having learned a good policy: this is because he occasionally makes a random exploratory move into a ghost. As a benchmark, it should take between 1,000 and 1400 games before Pacman's rewards for a 100 episode segment becomes positive, reflecting that he's started winning more than losing. By the end of training, it should remain positive and be fairly high (between 100 and 350). Make sure you understand what is happening here: the MDP state is the exact board configuration facing Pacman, with the now complex transitions describing an entire ply of change to that state. The intermediate game configurations in which Pacman has moved but the ghosts have not replied are not MDP states, but are bundled in to the transitions. Once Pacman is done training, he should win very reliably in test games (at least 90% of the time), since now he is exploiting his learned policy. However, you will find that training the same agent on the seemingly simple mediumGrid does not work well. In our implementation, Pacman's average training rewards remain negative throughout training. At test time, he plays badly, probably losing all of his test games. Training will also take a long time, despite its ineffectiveness. Pacman fails to win on larger layouts because each board configuration is a separate state with separate Q-values. He has no way to generalize that running into a ghost is bad for all positions. Obviously, this approach will not scale.


    Question 5 : Approximate Q-Learning

    Implement an approximate Q-learning agent that learns weights for features of states, where many states might share the same features. Write your implementation in ApproximateQAgent class in qlearningAgents.py, which is a subclass of PacmanQAgent. Note: Approximate Q-learning assumes the existence of a feature function f(s,a) over state and action pairs, which yields a vector f1(s,a) .. fi(s,a) .. fn(s,a) of feature values. We provide feature functions for you in featureExtractors.py. Feature vectors are util.Counter (like a dictionary) objects containing the non-zero pairs of features and values; all omitted features have value zero. The approximate Q-function takes the following form Q(s,a)=ni=1fi(s,a)wi

    where each weight wi is associated with a particular feature fi(s,a). In your code, you should implement the weight vector as a dictionary mapping features (which the feature extractors will return) to weight values. You will update your weight vectors similarly to how you updated Q-values: wiwi+αdifferencefi(s

    Autograding

    To get you familiarized with the autograder, we will ask you to code, test, and submit solutions for three questions. You can download all of the files associated the autograder tutorial as a zip archive: tutorial.zip. Unzip this file and examine its contents:

    [ta@nova ~]$ unzip tutorial.zip
    [ta@nova ~]$ cd tutorial
    [ta@nova ~/tutorial]$ ls
    addition.py
    autograder.py
    buyLotsOfFruit.py
    grading.py
    projectParams.py
    shop.py
    shopSmart.py
    testClasses.py
    testParser.py
    test_cases
    tutorialTestClasses.py
    

    This contains a number of files you'll edit or run:

    • addition.py: source file for question 1
    • buyLotsOfFruit.py: source file for question 2
    • shop.py: source file for question 3
    • shopSmart.py: source file for question 3
    • autograder.py: autograding script (see below)

    and others you can ignore:

    • test_cases: directory contains the test cases for each question
    • grading.py: autograder code
    • testClasses.py: autograder code
    • tutorialTestClasses.py: test classes for this particular project
    • projectParams.py: project parameters

    The command python autograder.py grades your solution to all three problems. If we run it before editing any files we get a page or two of output:

    [ta@nova ~/tutorial]$ python autograder.py
    Starting on 1-21 at 23:39:51
    
    Question q1
    ===========
    *** FAIL: test_cases/q1/addition1.test
    *** 	add(a,b) must return the sum of a and b
    *** 	student result: "0"
    *** 	correct result: "2"
    *** FAIL: test_cases/q1/addition2.test
    *** 	add(a,b) must return the sum of a and b
    *** 	student result: "0"
    *** 	correct result: "5"
    *** FAIL: test_cases/q1/addition3.test
    *** 	add(a,b) must return the sum of a and b
    *** 	student result: "0"
    *** 	correct result: "7.9"
    *** Tests failed.
    
    ### Question q1: 0/1 ###
    
    
    Question q2
    ===========
    *** FAIL: test_cases/q2/food_price1.test
    *** 	buyLotsOfFruit must compute the correct cost of the order
    *** 	student result: "0.0"
    *** 	correct result: "12.25"
    *** FAIL: test_cases/q2/food_price2.test
    *** 	buyLotsOfFruit must compute the correct cost of the order
    *** 	student result: "0.0"
    *** 	correct result: "14.75"
    *** FAIL: test_cases/q2/food_price3.test
    *** 	buyLotsOfFruit must compute the correct cost of the order
    *** 	student result: "0.0"
    *** 	correct result: "6.4375"
    *** Tests failed.
    
    ### Question q2: 0/1 ###
    
    
    Question q3
    ===========
    Welcome to shop1 fruit shop
    Welcome to shop2 fruit shop
    *** FAIL: test_cases/q3/select_shop1.test
    *** 	shopSmart(order, shops) must select the cheapest shop
    *** 	student result: "None"
    *** 	correct result: ""
    Welcome to shop1 fruit shop
    Welcome to shop2 fruit shop
    *** FAIL: test_cases/q3/select_shop2.test
    *** 	shopSmart(order, shops) must select the cheapest shop
    *** 	student result: "None"
    *** 	correct result: ""
    Welcome to shop1 fruit shop
    Welcome to shop2 fruit shop
    Welcome to shop3 fruit shop
    *** FAIL: test_cases/q3/select_shop3.test
    *** 	shopSmart(order, shops) must select the cheapest shop
    *** 	student result: "None"
    *** 	correct result: ""
    *** Tests failed.
    
    ### Question q3: 0/1 ###
    
    
    Finished at 23:39:51
    
    Provisional grades
    ==================
    Question q1: 0/1
    Question q2: 0/1
    Question q3: 0/1
    ------------------
    Total: 0/3
    
    Your grades are NOT yet registered.  To register your grades, make sure
    to follow your instructor's guidelines to receive credit on your project.
    

    For each of the three questions, this shows the results of that question's tests, the questions grade, and a final summary at the end. Because you haven't yet solved the questions, all the tests fail. As you solve each question you may find some tests pass while other fail. When all tests pass for a question, you get full marks. Looking at the results for question 1, you can see that it has failed three tests with the error message "add(a,b) must return the sum of a and b". The answer your code gives is always 0, but the correct answer is different. We'll fix that in the next tab.


    Question 1: Addition

    Open addition.py and look at the definition of add:

    def add(a, b):
        "Return the sum of a and b"
        "*** YOUR CODE HERE ***"
        return 0
    

    The tests called this with a and b set to different values, but the code always returned zero. Modify this definition to read:

    def add(a, b):
        "Return the sum of a and b"
        print("Passed a=%s and b=%s, returning a+b=%s" % (a,b,a+b))
        return a+b
    

    Now rerun the autograder (omitting the results for questions 2 and 3):

    [ta@nova ~/tutorial]$ python autograder.py -q q1
    Starting on 1-21 at 23:52:05
    
    Question q1
    ===========
    Passed a=1 and b=1, returning a+b=2
    *** PASS: test_cases/q1/addition1.test
    *** 	add(a,b) returns the sum of a and b
    Passed a=2 and b=3, returning a+b=5
    *** PASS: test_cases/q1/addition2.test
    *** 	add(a,b) returns the sum of a and b
    Passed a=10 and b=-2.1, returning a+b=7.9
    *** PASS: test_cases/q1/addition3.test
    *** 	add(a,b) returns the sum of a and b
    
    ### Question q1: 1/1 ###
    
    Finished at 23:41:01
    
    Provisional grades
    ==================
    Question q1: 1/1
    Question q2: 0/1
    Question q3: 0/1
    ------------------
    Total: 1/3
    
    

    You now pass all tests, getting full marks for question 1. Notice the new lines "Passed a=..." which appear before "*** PASS: ...". These are produced by the print statement in add. You can use print statements like that to output information useful for debugging.


    Question 2: buyLotsOfFruit function

    Add a buyLotsOfFruit(orderList) function to buyLotsOfFruit.py which takes a list of (fruit,pound) tuples and returns the cost of your list. If there is some fruit in the list which doesn't appear in fruitPrices it should print an error message and return None. Please do not change the fruitPrices variable. Run python autograder.py until question 2 passes all tests and you get full marks. Each test will confirm that buyLotsOfFruit(orderList) returns the correct answer given various possible inputs. For example, test_cases/q2/food_price1.test tests whether:

    Cost of [('apples', 2.0), ('pears', 3.0), ('limes', 4.0)] is 12.25

    Question 3: shopSmart function

    Fill in the function shopSmart(orders,shops) in shopSmart.py, which takes an orderList (like the kind passed in to FruitShop.getPriceOfOrder) and a list of FruitShop and returns the FruitShop where your order costs the least amount in total. Don't change the file name or variable names, please. Note that we will provide the shop.py implementation as a "support" file, so you don't need to submit yours. Run python autograder.py until question 3 passes all tests and you get full marks. Each test will confirm that shopSmart(orders,shops) returns the correct answer given various possible inputs. For example, with the following variable definitions:

    orders1 = [('apples',1.0), ('oranges',3.0)]
    orders2 = [('apples',3.0)]
    dir1 = {'apples': 2.0, 'oranges':1.0}
    shop1 =  shop.FruitShop('shop1',dir1)
    dir2 = {'apples': 1.0, 'oranges': 5.0}
    shop2 = shop.FruitShop('shop2',dir2)
    shops = [shop1, shop2]
    

    test_cases/q3/select_shop1.test tests whether:

    shopSmart.shopSmart(orders1, shops) == shop1

    and test_cases/q3/select_shop2.test tests whether:

    shopSmart.shopSmart(orders2, shops) == shop2

    Object Basics

    Although this isn't a class in object-oriented programming, you'll have to use some objects in the programming projects, and so it's worth covering the basics of objects in Python. An object encapsulates data and provides functions for interacting with that data.

    Defining Classes

    Here's an example of defining a class named FruitShop:

    class FruitShop:
    
        def __init__(self, name, fruitPrices):
            """
                name: Name of the fruit shop
    
                fruitPrices: Dictionary with keys as fruit
                strings and prices for values e.g.
                {'apples':2.00, 'oranges': 1.50, 'pears': 1.75}
            """
            self.fruitPrices = fruitPrices
            self.name = name
            print('Welcome to %s fruit shop' % (name))
    
        def getCostPerPound(self, fruit):
            """
                fruit: Fruit string
            Returns cost of 'fruit', assuming 'fruit'
            is in our inventory or None otherwise
            """
            if fruit not in self.fruitPrices:
                return None
            return self.fruitPrices[fruit]
    
        def getPriceOfOrder(self, orderList):
            """
                orderList: List of (fruit, numPounds) tuples
    
            Returns cost of orderList, only including the values of
            fruits that this fruit shop has.
            """
            totalCost = 0.0
            for fruit, numPounds in orderList:
                costPerPound = self.getCostPerPound(fruit)
                if costPerPound != None:
                    totalCost += numPounds * costPerPound
            return totalCost
    
        def getName(self):
            return self.name
    

    The FruitShop class has some data, the name of the shop and the prices per pound of some fruit, and it provides functions, or methods, on this data. What advantage is there to wrapping this data in a class?

    1. Encapsulating the data prevents it from being altered or used inappropriately,
    2. The abstraction that objects provide make it easier to write general-purpose code.

    Using Objects

    So how do we make an object and use it? Make sure you have the FruitShop implementation in shop.py. We then import the code from this file (making it accessible to other scripts) using import shop, since shop.py is the name of the file. Then, we can create FruitShop objects as follows:

    import shop
    
    shopName = 'the Berkeley Bowl'
    fruitPrices = {'apples': 1.00, 'oranges': 1.50, 'pears': 1.75}
    berkeleyShop = shop.FruitShop(shopName, fruitPrices)
    applePrice = berkeleyShop.getCostPerPound('apples')
    print(applePrice)
    print('Apples cost $%.2f at %s.' % (applePrice, shopName))
    
    otherName = 'the Stanford Mall'
    otherFruitPrices = {'kiwis':6.00, 'apples': 4.50, 'peaches': 8.75}
    otherFruitShop = shop.FruitShop(otherName, otherFruitPrices)
    otherPrice = otherFruitShop.getCostPerPound('apples')
    print(otherPrice)
    print('Apples cost $%.2f at %s.' % (otherPrice, otherName))
    print("My, that's expensive!")
    

    This code is in shopTest.py; you can run it like this:

    [ta@nova ~]$ python shopTest.py
    Welcome to the Berkeley Bowl fruit shop
    1.0
    Apples cost $1.00 at the Berkeley Bowl.
    Welcome to the Stanford Mall fruit shop
    4.5
    Apples cost $4.50 at the Stanford Mall.
    My, that's expensive!
    

    So what just happended? The import shop statement told Python to load all of the functions and classes in shop.py. The line berkeleyShop = shop.FruitShop(shopName, fruitPrices) constructs an instance of the FruitShop class defined in shop.py, by calling the __init__ function in that class. Note that we only passed two arguments in, while __init__ seems to take three arguments: (self, name, fruitPrices). The reason for this is that all methods in a class have self as the first argument. The self variable's value is automatically set to the object itself; when calling a method, you only supply the remaining arguments. The self variable contains all the data (name and fruitPrices) for the current specific instance (similar to this in Java). The print statements use the substitution operator (described in the Python docs if you're curious).

    Static vs Instance Variables

    The following example illustrates how to use static and instance variables in Python. Create the person_class.py containing the following code:

    class Person:
        population = 0
        def __init__(self, myAge):
            self.age = myAge
            Person.population += 1
        def get_population(self):
            return Person.population
        def get_age(self):
            return self.age
    

    We first compile the script:

    [ta@nova ~]$ python person_class.py

    Now use the class as follows:

    >>> import person_class
    >>> p1 = person_class.Person(12)
    >>> p1.get_population()
    1
    >>> p2 = person_class.Person(63)
    >>> p1.get_population()
    2
    >>> p2.get_population()
    2
    >>> p1.get_age()
    12
    >>> p2.get_age()
    63

    In the code above, age is an instance variable and population is a static variable. population is shared by all instances of the Person class whereas each instance has its own age variable.

    css.php