• Aucun résultat trouvé

Design and optimization of shared mobility on demand : dynamic routing and dynamic pricing

N/A
N/A
Protected

Academic year: 2021

Partager "Design and optimization of shared mobility on demand : dynamic routing and dynamic pricing"

Copied!
192
0
0

Texte intégral

(1)

Design and Optimization of Shared Mobility On Demand:

Dynamic Routing and Dynamic Pricing

by

Yue Guan

B.S.E. and B.A., Tsinghua University (2014)

S.M., Massachusetts Institute of Technology (2016)

Submitted to the Department of Mechanical Engineering

and Center for Computational Science and Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Mechanical Engineering and Computation

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

February 2021

c

○ Massachusetts Institute of Technology 2021. All rights reserved.

Author . . . .

Department of Mechanical Engineering

and Center for Computational Science and Engineering

November 30, 2020

Certified by . . . .

Anuradha M. Annaswamy

Senior Research Scientist of Mechanical Engineering

Thesis Supervisor

Accepted by . . . .

Nicolas G. Hadjiconstantinou

Professor of Mechanical Engineering

Chairman, Department Committee on Graduate Theses

Accepted by . . . .

Youssef M. Marzouk

Associate Professor of Aeronautics and Astronautics

(2)
(3)

Design and Optimization of Shared Mobility On Demand:

Dynamic Routing and Dynamic Pricing

by

Yue Guan

Submitted to the Department of Mechanical Engineering and Center for Computational Science and Engineering

on November 30, 2020, in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Mechanical Engineering and Computation

Abstract

Mobility of people and goods has been critical to urban life ever since cities emerged thousands of years ago. With the ushering in Cyber-Physical Systems enabled by the development of smart mobile devices, telecommunication technologies, as well as afford-able, accessible and powerful computing resources, new paradigms are revolutionizing urban mobility. Among these, Shared Mobility on Demand Service (SMoDS) has changed the landscape of urban transportation, providing alternatives with a customized combi-nation of affordability, flexibility, and carbon footprint. Dynamic routing and dynamic pricing are two central pillars of an SMoDS solution, where the former offers customized routes according to the specific passenger request and real time traffic conditions, and the latter provides incentive signals that appropriately influence the passengers’ subscription of the service. Although emerging SMoDS solutions have seen remarkable successes, fur-ther improvements are in need. In this thesis, we present an integrated SMoDS design with dynamic routing and dynamic pricing that introduces two major improvements over the state of the art: (i) enhanced optimality in travel times through dynamic routing with added spatial flexibility, and (ii) explicit accommodation of behavioral modelling of em-powered passengers so as to lead to an accurate dynamic pricing strategy.

The first part of this thesis focuses on the development of the dynamic routing frame-work with a new concept of space window. To accommodate the complexity introduced by space window in the optimization of dynamic routes, we propose an algorithm based upon the Alternating Minimization (AltMin) paradigm, and demonstrate an order of magnitude improvement in computational efficiency compared to benchmarks provided by standard solvers.

The second part of this thesis, related to dynamic pricing, is broken down into two modules, with the first related to behavioral modelling of empowered passengers based on Cumulative Prospect Theory (CPT). The CPT based behavioral model is able to

(4)

cap-ture the subjective and potentially irrational behaviors of passengers when deciding upon the SMoDS ride offer amidst uncertainties and risks associated with framing effects, loss aversion, diminishing sensitivity, and probability distortion. Key properties and the im-plications of the CPT based passenger behavioral model on dynamic pricing are discussed in detail. The second module of dynamic pricing determines the desired probability of ac-ceptance from each passenger so as to optimize key performance indicators of the SMoDS such as the estimated waiting time. A Reinforcement Learning (RL) based approach com-bined with the problem formulation in the form of a Markov Decision Process (MDP) is used to estimate this desired probability of acceptance. The proposed RL algorithm deploys an integrated planning and learning architecture where the planning phase is carried out by a lookahead tree search, and the learning phase is achieved via value iter-ation using a neural network as the value function approximator. Two major challenges that arise in this context is the varying dimension of the underlying state and the ar-rival of information in a sequential manner where long-term dependency needs to be preserved. These are addressed through the incorporation of Long Short-Term Memory (LSTM), convolutional and fully-connected layers. Their judicious incorporation in the underlying neural network architecture allows the extraction of this information and suc-cessful estimation of the desired probability of acceptance that leads to the optimization of the SMoDS. A number of computational experiments are carried out using various datasets of large-scale problems and are shown to result in a superior capability of the proposed RL algorithm.

Thesis Supervisor: Anuradha M. Annaswamy

(5)

Acknowledgments

I would love to express my deepest and sincerest gratitude to my advisor, Dr. Anuradha M. Annaswamy, for her continuous support of my research and class work during the past slightly over four years. I have learned a tremendous amount of knowledge and life lessons from her. She has granted me with ample freedom to explore my own inter-est while providing the necessary guidance and inspiration in times of difficulties. Her standard of excellence has always helped push forward the boundaries and those heated discussions with her to pursue the true north have been enlightening. I am extremely fortune to have such an approachable and devoted mentor. I cannot thank her enough for everything she has done for me throughout this journey.

I would love to thank my doctoral committee members, Prof. George Barbastathis, Prof. Jonathan How, and Prof. Cathy Wu for spending time participating in my research and providing valuable feedback and critiques along the way.

I would love to acknowledge the generous financial support from Ford Motor Com-pany through Ford-MIT Alliance that has been sponsoring the research discussed in this thesis ever since the year of 2016. And I am grateful to all my collaborators from the Ford team: Eric H. Tseng, Hao Zhou, Eric Wingfield, and Ling Zhu who have provided a lot of constructive suggestions that have helped me approach these projects from different angles. It was a stimulating experience to work with them and I would not have been in the same position without their help.

I would also love to thank fellow members of the Active Adaptive Laboratory (AAC Lab): Max Zheng Qu and Jordan Romvary who introduced me to the group, Ben Jenkins, Damas Limoge, David Flamholz, and Seyed Mehran Dibaji with whom I have spent many late nights in the Nexus, Joseph Gaudio with whom I shared the ups and downs through-out the Ph.D. career, Vineet Jagadeesan Nair as well as our visiting students Han Zheng and Claudio Lombardi for their contributions to expanding the smart cities research field in the group, and Benjamin Thomsen, David D’Achiardi, Rabab Haider, Abhishek Patkar, Yohan John, Venkatesh Venkataramanan, Amir Farjadian, Stefanos Baros and many

(6)

oth-ers for the wonderful time we shared together.

Many thanks to Tony Pulsone for helping me with expense reporting, conference room booking and other logistics issues. I also thank Leslie Regan, Una Sheehan, Kate Nelson, Britton Bradley and many others for their help with other administration related matters. I would also love to thank all the wonderful friends I met at or outside of MIT, across the globe. You have been so supportive and inspiring and without you guys I could not have gone so far. You have filled these years with colorful memories which will be cherished wherever I go.

This thesis is dedicated to my parents and grandparents. During the past six years, as a graduate student staying abroad, I have not been able to visit and spend time with you quite often. Thank you for your unconditional understanding, support, sacrifices and love ever since the beginning of time.

Last but not least, special thanks go to the healthcare workers and Task Force 2021 and Beyond at MIT. This year has been extremely tough - many people have lost their lives or jobs due to the pandemic and other crises. Though many things have been disrupted unexpectedly, I am grateful that I can stay safe and health, and am still able to pursue my academic and professional goals. Without the healthcare workers battling the pandemic on the front line and the careful designing, monitoring and revising of the protocols by the MIT Task Force 2021 and Beyond, this would not have been possible. Truly wish everyone will be fine and hope we could get through this soon, as one.

(7)

Contents

List of Figures 11

List of Tables 15

1 Introduction 19

1.1 Background and Motivation . . . 19

1.2 Thesis Contributions . . . 25

1.2.1 Dynamic Routing: Enhanced Optimality in Travel Times through Added Spatial Flexibility . . . 26

1.2.2 Dynamic Pricing: Explicit Accommodation of Passenger Behavioral Modelling . . . 27

1.3 Thesis Scope and Outline . . . 30

2 Preliminaries 33 2.1 Mathematical Optimization . . . 33

2.1.1 General Formulation . . . 33

2.1.2 Taxonomy . . . 34

2.2 Reinforcement Learning . . . 35

2.2.1 Markov Decision Process . . . 36

2.2.2 Value Iteration . . . 40

2.3 Neural Networks for Sequences . . . 42

(8)

2.3.2 Recurrent Neural Networks . . . 45

2.3.3 Long Short-Term Memory Networks. . . 47

2.3.4 Convolutional Neural Networks . . . 48

2.3.5 Model Optimization . . . 50

3 Dynamic Routing: Enhanced Spatial Flexibility Enabled by Space Window 53 3.1 Introduction . . . 53 3.2 Problem Formulation . . . 58 3.2.1 Preliminaries . . . 59 3.2.2 Objective Function . . . 61 3.2.3 Constraints . . . 62 3.2.4 Master Problem . . . 63

3.3 A Mixed Integer Quadratically Constrained Programming Formulation . . 65

3.4 An Alternating Minimization Algorithm . . . 72

3.4.1 Phase 1: Optimize over S with R Fixed . . . 73

3.4.2 Phase 2: Optimize over R with S Fixed . . . 79

3.4.3 Convergence Analysis and Stopping Criterion . . . 81

3.5 Computational Experiments . . . 83

3.6 Extension to Dynamic Routing . . . 89

3.7 Summary . . . 92

4 Dynamic Pricing Part I: Cumulative Prospect Theory based Passenger Behavioral Modelling 95 4.1 Introduction . . . 95

4.2 Preliminaries. . . 98

4.2.1 Discrete Choice Model . . . 98

4.2.2 Expected Utility Theory . . . 99

(9)

4.3 CPT based Passenger Behavioral Model in SMoDS . . . 102

4.3.1 Objective and Subjective Utilities . . . 103

4.3.2 Interpretation of Risk Attitudes . . . 103

4.3.3 Reference Points . . . 104

4.3.4 Subjective Weighting of Probability Distributions . . . 105

4.3.5 Key Properties of CPT based Behavioral Model . . . 106

4.4 Implications of CPT using Computational Experiments . . . 108

4.4.1 Determination of Parameters . . . 108

4.4.2 Fourfold Pattern of Risk Attitudes . . . 110

4.4.3 Strong Risk Aversion over Mixed Prospects . . . 113

4.4.4 Self Reference . . . 115

4.4.5 Remarks . . . 116

4.5 Dynamic Price Design . . . 117

4.6 Summary . . . 117

5 Dynamic Pricing Part II: Reinforcement Learning the Desired Probability of Acceptance 119 5.1 Introduction . . . 119

5.2 Preliminaries. . . 121

5.3 Problem Formulation . . . 122

5.3.1 State Space𝒮 . . . 124

5.3.2 Action Space𝒜 . . . 125

5.3.3 State Transition Function𝒫 . . . 126

5.3.4 Reward Functionℛ . . . 129

5.3.5 Discount Factor γ . . . 130

5.3.6 Value Function . . . 130

5.4 Solving the MDP using DP and RL . . . 132

5.4.1 Dynamic Programming for Small-scale Problems . . . 133

(10)

5.5 Computational Experiments . . . 147

5.5.1 Small-scale Problems . . . 149

5.5.2 Medium-scale Problems . . . 152

5.5.3 Large-scale Problems. . . 154

5.5.4 Extension I: Huge-scale Problems . . . 159

5.5.5 Extension II: Online Problems. . . 160

5.6 Summary . . . 160

6 Conclusions and Future Work 163 6.1 Summary of Results. . . 163

6.2 Future Work . . . 167

A Supplementary Materials: More on Value Network Design 169 B Supplementary Materials: More on Value Network Training 175 B.1 Techniques in Value Network Training . . . 175

B.2 Summary of Hyperparameters . . . 179

(11)

List of Figures

1-1 Operating procedure of a Shared Mobility on Demand Service: request,

offer, decide, and operate. . . 21

1-2 Illustration of a transactive controller in a smart infrastructure.. . . 23

1-3 Proposed SMoDS design with integrated dynamic routing and dynamic pricing. . . 26

2-1 Illustration of a feedforward, deep, and fully-connected neural network design. . . 44

2-2 Illustration of a recurrent neural network model. . . 46

2-3 Illustration of the repeating unit in an LSTM model. . . 48

2-4 Illustration of a 1-dimensional discrete convolution operation. . . 49

3-1 Spectrum of shared mobility services. . . 55

3-2 Complete door-to-door mode versus semi door-to-door mode. . . 56

3-3 Illustration of further improvement on travel times enabled by space win-dow.. . . 56

3-4 Illustration of scenarios where certain favorable sequences or routing points are ruled out due to clustering. . . 65

3-5 Illustration of the overall routes of four instances with the same set of re-quests and different problem settings. . . 88

3-6 Demonstrations of double walks in dynamic routing with space window. . 90

(12)

4-1 Source of systematic uncertainty in the SMoDS.. . . 100

4-2 Illustrations of V(·)and π(·)in the CPT framework. . . 101

4-3 Illustration of the fourfold pattern of risk attitudes in the SMoDS context. . 112

4-4 Comparison of pUs¯ and po. . . 114

4-5 Comparison of pUs¯ with psAo using four different fX(x). . . 115

5-1 Schematic of the MDP formulation. . . 123

5-2 Dimension of the state representation. . . 125

5-3 Illustration of state transition. . . 129

5-4 Lookahead tree that illustrates the exact DP algorithm. . . 134

5-5 Illustration of node split from passenger decision in response to the ride offer.135 5-6 Schematic of H-DP( ˜N). . . 137

5-7 Schematic of the RL algorithm with an integrated planning and learning architecture. . . 139

5-8 Overall value network design.. . . 142

5-9 More detailed illustration of the value network design with each layer la-beled. . . 143

5-10 Zoomed-in illustration of the vehicle LSTM block. . . 144

5-11 Agent collects experiences through interacting with the environment. . . 145

5-12 Regulation of EWT(t)around EWT*and the incurred expected acceptance rates for various EWT* values. . . 150

5-13 Regulation of EWT(t)around a time-varying EWT*. . . 152

5-14 Evaluation of the RL algorithm on medium-scale problems. . . 154

5-15 Batch loss Li(wi)versus training step i in large-scale problems.. . . 155

5-16 Average value of states in two held out sets with the training step.. . . 156

5-17 Expected total discounted reward ˜R with respect to the number of looka-head steps ˜N. . . 157

5-18 Incurred EWT(t) with respect to the time index, with ˜N = 1, of a larger-scale problem. . . 158

(13)

A-1 Dimension of the value network design. . . 170

A-2 Architecture of one earlier value network design without convolutional layers. . . 173

A-3 Architecture of one earlier value network design with vanilla RNNs in-stead of LSTM blocks. . . 173

A-4 Architecture of one earlier value network design with fewer convolutional filters. . . 174

(14)
(15)

List of Tables

3.1 Dimensionality of the decision variables and constraints of the MIQCP for-mulation. . . 71

3.2 Dimensionality of the decision variables and constraints of the QCQP for-mulation for Phase 2. . . 81

3.3 Settings of the computational experiments to evaluate the AltMin dynamic routing algorithm. . . 84

3.4 Results of the computational experiments to evaluate the AltMin dynamic routing algorithm. . . 86

4.1 Numerical values of parameters used in computational experiments for evaluating the CPT based passenger behavioral model for the SMoDS. . . . 109

5.1 Landscape of the computational experiments to demonstrate the capability of the proposed RL algorithm. . . 148

5.2 Results of the RL policy evaluated on another four datasets of large-scale problems.. . . 159

5.3 Results of the RL policy evaluated on a huge-scale problem. . . 159

5.4 Results of the RL policy evaluated on an online problem. . . 160

A.1 Comparison of error on value function approximation of the final value network design and five previous designs. . . 171

(16)
(17)

Bibliographic Notes

This thesis is based on a number of publications, which are listed below.

[1] Anuradha M Annaswamy, Yue Guan, H Eric Tseng, Hao Zhou, Thao Phan, and Diana Yanakiev. Transactive control in smart cities. Proceedings of the IEEE, 106(4): 518–537, 2018.

[2] Yue Guan, Anuradha M Annaswamy, and H Eric Tseng. A dynamic routing frame-work for shared mobility services. ACM Transactions on Cyber-Physical Systems, 4(1): 1–28, 2019.

[3] Yue Guan, Anuradha M Annaswamy, and H Eric Tseng. Cumulative prospect the-ory based dynamic pricing for shared mobility on demand services. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 2239–2244. IEEE, 2019.

[4] Yue Guan, Anuradha M Annaswamy, and H Eric Tseng. “Towards Dynamic Pricing for Shared Mobility on Demand using Markov Decision Processes and Dynamic Programming.” In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2020.

[5] Vineet J Nair, Yue Guan, Anuradha M Annaswamy, and H Eric Tseng. “Sensitivity analysis of passenger behavioral model for dynamic pricing of shared mobility on demand.” in revision, 2020.

(18)

of passenger behavior model into shared mobility on demand services using rein-forcement learning.” to be submitted, 2020.

(19)

Chapter 1

Introduction

1.1

Background and Motivation

Mobility of people and goods has been critical to urban life ever since cities emerged thou-sands of years ago. Over one billion vehicles travel on the roads today, and that number is projected to double by the year of 2,050 [81]. New approaches and solutions are re-quired to solve, or at least improve the quality of urban mobility, both on highways and on city streets. With the ushering in Cyber-Physical Systems [95] enabled by the devel-opment of smart mobile devices, telecommunication technologies, as well as affordable, accessible and powerful computing resources, new paradigms, e.g., autonomous driv-ing, connected vehicles, electrification, and shared mobility, are emerging that have the potential to disrupt the way people travel around.

Among these, shared mobility, or more formally, Shared Mobility on Demand Ser-vice (SMoDS), has been revolutionizing the ground transportation infrastructure in urban centers by providing timely and convenient transportation to anybody, anywhere, and anytime [15,44,51,105,125]. Until recently, available solutions for urban transportation have been clearly binary, with the first option represented by conventional public trans-portation, e.g., buses and subways, that provides low cost solutions and reduced carbon footprint per traveler at increased times of walking, waiting, and riding, and the second

(20)

represented by individual private automobiles, e.g., taxis and driving, that reduce the travel times of individuals but with a significantly high cost and increased carbon foot-print. The emergence of SMoDS platforms such as Uber [3], Lyft [1], Didi Chuxing [8], and Grab [6] has changed this landscape, providing alternatives with a customized com-bination of affordability, flexibility, and carbon footprint.

SMoDS is able to provide passengers with a reliable mode of transportation that is catered to the individual, and at the same time enhances the utilization of the underly-ing resources. This is enabled by the two central underly-ingredients of SMoDS, dynamic rout-ing [62,136] and dynamic pricing [90,114]. Dynamic routing responds to real time travel requests from passengers with customized routes according to the specific request and real time traffic conditions, and dynamic pricing provides the incentive signal that suit-ably influences the decision from the empowered passengers upon whether to accept the ride offer or reject it and turn to an alternative transportation option instead. With these, SMoDS typically operates in a four-step procedure illustrated in Fig. 1-1: (i) Request: pas-sengers request the shared ride service with specified pickup/drop-off locations, max-imum distances willing to walk, and other requirements, e.g., time window of service, if needed; (ii) Offer: the SMoDS server distributes ride offers to passengers consisting of customized route, including pickup and drop-off locations, walking distance, time of pickup and drop-off, and price; (iii) Decide: passengers decide whether to accept or de-cline the ride offer according to the specifications provided; and (iv) Operate: the SMoDS server sends out operational instructions to the vehicles according to the decisions from the passengers and then the trip starts or is updated.

Researches have revealed the potential of SMoDS for a tremendous positive impact on personal mobility, pollution, congestion, energy consumption, and thereby quality of life [4,16,87,103,149]. Ref. [2] illustrates that the idle rate of vehicles in the United States and United Kingdom is at the level of 95% and this number can be reduced greatly if SMoDS can be widely adopted. A recent article [14] demonstrates that if SMoDS is implemented in New York City, 98% of the taxi demands could be satisfied by 3,000

(21)

four-1. Request Passenger ID: P007 From: Stata To: The Q Willing to walk: 0.2 mi Pickup Window: 5:30 - 5:40 pm Shared Mobility on Demand Server 2. Offer From: BLVD 35 (0.2 mi to Stata) To: AMC (0.2 mi to The Q) Pickup: 5:35 pm Drop-off: 5:50 pm Price: $2 4. Operate Shuttle ID: S001 Passenger ID: P007 Location: BLVD 35 & AMC Time: 5:35 and 5:50 pm

3. Decide Decision: accept

Figure 1-1. Operating procedure of a Shared Mobility on Demand Service: request, offer,

de-cide, and operate.

passenger vehicles, which is fewer than 25% of the number of taxis currently operating in the city with a waiting time of 2.7 minutes on average. Furthermore, over 95% of the trips are feasible to be shared with other trips given a maximum 5-minute travel delay [122]. Similar empirical laws appear to be applicable in other cities as well, suggesting that the potential of SMoDS solutions is a global one [133]. These estimates do not even consider the cost of other potential negative externalities such as the vehicular emissions (greenhouse gas emissions and particulate matter) [112], travel-time uncertainty [38], and a higher propensity for accidents [75].

Although the emerging SMoDS solutions have achieved remarkable successes, there is still a lot of room to improve.

The dynamic routing scheme in existing literatures [14,18,99,122,130,133] mostly utilize the complete door-to-door mode, where the vehicle picks up and drops off passen-gers exactly at the locations as requested, resulting in the capacity of shared trips being very limited in order to avoid long travel delays. As a result, the utilization of the

(22)

in-frastructure is still limited and the shared rides are fairly expensive for daily commuters. Refs. [13,42,50] propose the concept of meeting points where the vehicles direct the pas-sengers to intermediate locations other than requested ones for pickup that partially push forward the boundaries however this proposed solution is not dynamic in real time and passengers have no control over the actual walking distance incurred. We further fill in the gap by proposing a novel concept termed as Space Window which leverages passen-gers’ willingness to walk. Space window not only introduces more degrees of freedom for dynamic routing to further optimize travel times, but also preserves the flexibility for passengers to tune the service quality along the spectrum of public and personalized transportation and therefore introduces a continuum of solutions at various levels of cost, flexibility, and carbon footprint.

The dynamic pricing strategy in the SMoDS context in fact falls into a broader notion of Transactive Control [16,71,73], which corresponds to the control mechanism enabled through economic contracts to incentivize and enable flexible consumption from empow-ered customers. More generally, the empowerment of customers and the treatment of dynamic prices as incentive signals also belong to human-in-the-loop modelling, where hu-man beings interact with the infrastructure, not only as the receivers of the resources, but also as actuators that contribute to the realization of the system objective [47,57,151]. The concept of transactive control dates back at least 40 years to the late 1970s in the context of smart grids [123,124]. The design of a transactive controller in a smart infrastructure typically consists of an incentive signal sent to the empowered consumer from the infras-tructure and a feedback signal received from the consumer, and together the goal is to ensure that the underlying resources are optimally utilized. This implies that the over-all process introduces a feedback loop, where empowered consumers serve as actuators into an infrastructure, and transactive control represents a feedback control design that ensures that the goals of the infrastructure are realized (Fig. 1-2).

In the SMoDS context, passengers are empowered in the sense that they have the choice to either accept the ride offer from the server or reject it and choose an alternative

(23)

Transactive Controller Empowered Customers Infrastructure Incentive Signal Feedback Signal Actuators Plant Sociotechnical Model

Figure 1-2. Illustration of a transactive controller in a smart infrastructure. A mechanism through which system- and component-level decisions are made through economic transactions between the components of the system, in conjunction with or in lieu of traditional controls.

transportation option instead. Therefore the transactive controller is in place providing dynamic price as the incentive signal that nudges the passenger to output appropriate feedback signal that optimizes the SMoDS system performance. The critical component of the development of a transactive controller is the sociotechnical model [25,37,58] that combines the behavioral model of empowered customers and the technical module that captures the response from the infrastructure.

For the former, Expected Utility Theory [140] has been widely used [21,118,129], which assumes that passengers are rational. However, this assumption hardly holds in the SMoDS context since the uncertainty in travel times and hence in the utility of taking the service is a central feature of the SMoDS due to the dynamic accommodation of new requests at anytime during the trip. And research has revealed that people tend to be-have in a irrational fashion when exposed to uncertainty or risk, therefore the use of EUT under this circumstance is inadequate [84,138]. Besides, even more simplified behavioral models have been used, e.g., passengers accept ride offers if they are below a threshold price and vice versa, which is obviously not satisfying [96]. Passenger decisions influence the SMoDS system performance and hence it is crucial to understand how the decisions are made in a more careful way. We incorporate Prospect Theory, a Nobel Prize-winning modelling framework [84,138], to fill in this gap.

(24)

in existing literatures [33,36,91]: (i) Economists apply empirical methodologies to investi-gate the correlation between price surge and various system objectives including average waiting time, idle rate [40,45,91,100]. The major drawback of this approach is that it does not provide a rigorous design rule of the dynamic pricing strategy that guarantees to enhance the system performance, but rather an empirical trial and error. (ii) Queueing-theoretic models have also been explored, where the fleet and passenger requests are modeled as two queues which are matched by dynamic pricing [21,22]. The major draw-back of this approach is that it is not able to accommodate the complex spatiotemporal dynamics of vehicles with multiple passengers sharing the trip, but rather models the vehicles with a binary status of being either available or occupied. In addition, it ignores the long-term dependency which is an important feature in the SMoDS context and hence should have been preserved. (iii) The use of Markov Decision Process (MDP). This ap-proach is along the lines of [41,96,118]. These papers, however, use MDP to directly model dynamic prices. Such an approach can become inadequate as it attempts to cap-ture both the design elements of the vehicle operations as well as the stochastic behavioral elements of the empowered passengers who are influenced by theses prices. Instead, our proposed approach decouples these two elements and focuses completely on the former.

Furthermore, the algorithms developed in [41,96,118] are oversimplified and not pow-erful enough to tackle the sophisticated MDPs. The SMoDS has a number of challenges that makes the application of MDP based approaches far from straightforward. The two biggest challenges are the varying dimension of the underlying state and the sequential availability of information where long-term dependency needs to be preserved. We over-come these challenges by incorporating Reinforcement Leaning (RL) with a neural net-work designed as the value function approximator whose architecture directly addresses theses features. Such an application of RL with a neural network has not been carried out before. Existing research falls short in one way or another. Most of them do not attempt to represent the full state information, they either consider very small-scale problems or only local information, where either a tabular representation suffices or a representation

(25)

truncated with a fixed dimension is utilized. These approaches have limited applications since the former cannot be scaled up and the latter is myopic due to the lack of global information of the SMoDS platform [41]. In addition, several recent works that attempt to represent the state all focus on designing a set of hand-crafted features and utilizing a simple structure, e.g., linear or piecewise linear combination of these features, to ap-proximate the optimal value function [96,118]. Though this approach is able to handle the first challenge of time-varying dimension by setting the number of features fixed, it suffers from two major drawbacks that prevent it from resolving the second challenge: (i) it relies heavily on the quality of the features and the quantity is typically limited, which is at risk of losing essential information and hard to extract the rich hidden information embedded in the state; and (ii) the approximator structure is too simplified, which pre-vents it from providing sufficient amount of degrees of freedom to approximate the value function which is highly nonlinear and complex in the SMoDS context [83,148]. To fill in this gap, the proposed neural network design has a specialized architecture integrat-ing Long Short-Term Memory (LSTM) [60,77,146], convolutional [86,92,94] and fully-connected [61,80,121] layers that enables the RL agent to approximate the state value with good accuracy and hence take appropriate actions so as to optimize the long-term SMoDS system performance.

1.2

Thesis Contributions

The overall contribution of this thesis is an integrated design of a Shared Mobility on De-mand Service with dynamic routing and dynamic pricing, which responds to real time travel needs from passengers who are suitably incentivized to appropriately subscribe to the service such that the system performance is optimized (Fig. 1-3). The proposed SMoDS solution achieves the following improvements over the state of the art: (i) en-hanced optimality in travel times through added spatial flexibility in dynamic routing, and (ii) explicit accommodation of passenger behavioral modelling so as to lead to an accurate dynamic pricing strategy.

(26)

SMoDS Design Space Window +

AltMin

Dynamic Pricing Dynamic Routing

Behavioral Modeling via CPT

Inverse Behavioral Model Optimize System Performance over via RL

Passenger Probability of Acceptance Influences System Performance

Figure 1-3. Proposed SMoDS design with integrated dynamic routing and dynamic pricing. Dynamic routing with space window solved via the AltMin optimization algorithm is introduced

in Chapter3. Dynamic pricing is further broken down into two parts, with Part I corresponding

to the CPT based passenger behavioral modelling and Part II corresponding to solving the desired probability of acceptance from each empowered passenger that leads to the optimization of the

long-term SMoDS system performance, which are elaborated in Chapters4and5, respectively.

1.2.1

Dynamic Routing: Enhanced Optimality in Travel Times through

Added Spatial Flexibility

The first major contribution of this thesis is the enhanced optimality in travel times in dynamic routing, enabled by the concept we have developed which is termed as space window.

To fill in the existing gap amidst the complete door-to-door mode, we have proposed the concept of space window where passengers are assumed to be willing to walk for a short distance both before being picked up and after being dropped off. Space win-dow provides room for further improvement on travel times over the state of the art in the following three aspects: (i) added spatial flexibility in routing, since pickup/drop-off points expand to pickup/drop-off regions which contain multiple feasible locations for service; (ii) clustering of several pickup and drop-off events to the same routing point if the corresponding space windows share a common intersection thereby reducing the number of vehicle stops, and (iii) leveraging asymmetry of travel times between

(27)

pedes-trians and vehicles considering real street topologies such as one-ways, pedestrian-only lanes or jammed streets, such that a short walk for the pedestrian may save a long detour for the vehicle. In addition, space window enables passengers to retain the flexibility to tune the service quality along the spectrum by specifying their own willingness to walk. The degrees of freedom introduced by space window enable further optimization of the travel times, and therefore provide a continuum of solutions at various levels of cost, flexibility, and carbon footprint.

Moreover, with added degrees of freedom provided by space window, the actual com-putation of the dynamic routes are however much more burdensome since: (i) more de-cision variables, e.g., routing points, clustering pattern, are introduced, and (ii) the op-timization problem converts to a Mixed Integer Programming, which is typically harder to solve than Integer Programming which corresponds to dynamic routing without space window. To cope with such difficulties, we have developed an Alternating Minimization (AltMin) algorithm that is able to effectively solve dynamic routing with space window. Computational experiments with various settings using real operational data have been conducted demonstrating that the AltMin algorithm is able to outperform the state of the art optimization solvers, e.g., Gurobi [7], with an order of magnitude improvement on computational efficiency and maintain minimal optimality gaps [16,67].

1.2.2

Dynamic Pricing: Explicit Accommodation of Passenger

Behav-ioral Modelling

The second major contribution of this thesis is the explicit accommodation of behavioral modelling of empowered passengers through Cumulative Prospect Theory and a judi-cious application of Reinforcement Learning, which to the best of our knowledge, has never been reported in existing SMoDS literatures.

We break down the dynamic pricing strategy into two modules as illustrated in Fig.

1-3: (i) the first refers to behavioral modelling of the empowered passengers, which essen-tially estimates the mapping p= f(µ)from the dynamic price µ to the probability p with

(28)

which the passenger accepts the ride offer; and (ii) the second corresponds to the deriva-tion of the desired probability of acceptance p* that optimizes the long-term SMoDS sys-tem performance. The price µ*that nudges the passenger towards p*can then be derived simply by setting p = p*and solving the inverse behavioral model µ* = f−1(p*). In fact, these two modules constitute the sociotechnical model for the SMoDS as indicated in Fig.

1-2, with the first module corresponding to mapping the incentive signal to the feedback signal through empowered customers, and the second corresponding to deriving the de-sired feedback signal that achieves the dede-sired goals of the infrastructure1.

For the first module, we have incorporated Prospect Theory (PT), in particular, Cumu-lative Prospect Theory (CPT), for the behavioral modelling of empowered passengers that captures their subjective decision-making process. PT [84,138] is a Nobel Prize-winning theory that models subjective decision-making of human beings under risk or uncertainty, and has achieved remarkable successes in behavioral economics [24] and cognitive psy-chology [17]. In the context of transportation, PT has been explored in [23,72,147]. How-ever, to the best of our knowledge, no prior work has been reported related to the applica-tions of PT in the SMoDS or for evaluating dynamic prices. We are the first to incorporate PT in quantitative passenger behavioral modeling, analysis of its impacts on risk attitudes of passengers, and implications on dynamic pricing for the SMoDS, which fill the gap in existing literatures.

The second module pertains to the determination of the desired probability of accep-tance from each empowered passenger through the incorporation of RL such that the long-term SMoDS system performance is optimized. There are three innovations in the context of such an application of RL: (i) problem formulation; (ii) algorithm development; and (iii) demonstration with a large-scale numerical study. Each of these innovations are expanded on as follows.

Problem formulation: To fill in the existing gap, we model passenger probability of ac-ceptance directly, which is the direct control variable of the MDP, instead of the price

(29)

which is the incentive of this probability. The actual price will be further derived using the inverse passenger behavioral model detailed above. This formulation enables the sep-aration of the CPT based passenger behavioral model from the MDP framework, making it more tractable. In addition, none of the existing works discussed above apply CPT in passenger behavioral modelling, but rather deploy oversimplified models that are not sufficient in the SMoDS context. The benefit of using the MDP to determine the desired probability of acceptance is its flexibility. It allows the consideration of a variety of system objectives where the performance metric could be chosen from a wide range including the regulation of estimated waiting time, maximization of revenue and ridership, etc. Over-all, such a problem formulation which breaks down the overall problem into a modular form is new and an innovation we have introduced in this thesis.

Algorithm development: The next step is to develop an algorithm that is able to effec-tively tackle the MDP. The use of Dynamic Programming (DP) is inadequate as the size of the problem, i.e., number of active passengers in the SMoDS platform, increases. We propose the use of RL, as an approximation to DP, to derive the optimal policy of the MDP. The algorithm enables the SMoDS server to learn to design the desired probability of acceptance in the way that the long-term SMoDS system performance is enhanced. The proposed RL algorithm fills in the gap in existing MDP based approaches by deploying a neural network design as the value function approximator with a specialized architec-ture which integrates LSTM, convolutional and fully-connected layers. The mechanism with which the challenges discussed in Section1.1can be resolved via the proposed value network design is that: (i) weights sharing between LSTM units enables it to handle in-put with a time-varying dimension using a fixed number of weights, and (ii) the inherent recurrent structure enables LSTMs to extract insights from sequential information, and the application of cell state enables the preservation of long-term dependency and that of the three gates enables the network to tune the range of memory to utilize. In ad-dition, the incorporation of convolutional and fully-connected layers enables integrated regression from the state representation to the optimal value function with good accuracy.

(30)

To the best of our knowledge, our results of designing a neural network that accurately represents the state and approximates the complex value function in the SMoDS context in such an end-to-end fashion is totally new. We have demonstrated the capability of the proposed value network via: (a) a 2.47% error on value function approximation on medium-scale problems benchmarked against ground truth, and (b) improved optimal-ity on expected total discounted reward validated through various datasets of large-scale problems compared with the baseline.

Demonstration with computational experiments: We have been able to demonstrate the effectiveness of the proposed RL algorithm in tackling large-scale MDPs with a number of computational experiments, even with the number of passengers increasing up to 50, with which the results are promising from a practical operations perspective [10]. Moreover, these results in the SMoDS context illustrate that in a human-in-the-loop model, through actively and appropriately leveraging the customer empowerment as degrees of freedom, the objective of the underlying sociotechnical system could be realized at various desired levels [66,68,69,109].

1.3

Thesis Scope and Outline

This thesis presents an integrated Shared Mobility on Demand Service design with dy-namic routing and dydy-namic pricing as illustrated in Fig.1-3. The dynamic routing frame-work deploys a novel concept that we have developed termed as space window, which enhances spatial flexibility and therefore enables further improvement on travel times over the state of the art. The dynamic pricing strategy fills in the gap in existing litera-tures by explicitly accommodating subjective decision-making of empowered passengers through the incorporation of Cumulative Prospect Theory in behavioral modelling and a judicious application of Reinforcement Learning, thereby designing prices through which passengers are appropriately incentivized such that the incurrd long-term SMoDS system performance is optimized.

(31)

 Chapter2provides the preliminaries for the development of this thesis work. These correspond to the tools that are universal to our methodologies, including the fun-damentals of mathematical optimization, reinforcement learning and neural net-work models specialized in dealing with sequences.

 Chapter 3 presents the proposed dynamic routing framework with added spatial flexibility enabled by the novel concept of space window, and the Alternating Min-imization algorithm that has been developed to tackle the optMin-imization of the dy-namic routes. Computational experiments are included that demonstrate the per-formance of the AltMin algorithm in terms of both computational efficiency and optimality, benchmarked against solutions obtained by Gurobi, the standard opti-mization solver. These contents correspond to the first building block in Fig.1-3.

 Chapter4focuses on the first part of the dynamic pricing strategy which refers to the CPT based passenger behavioral model in the SMoDS context. The fundamentals of CPT, development of the pipeline in applying CPT in behavioral modelling, key properties of the CPT based behavioral model, and three implications on dynamic pricing for the SMoDS are all presented. This chapter is reflected as the second building block in Fig. 1-3.

 Chapter5presents the second part of dynamic pricing, which pertains to appropri-ately leveraging passenger empowerment, i.e., desired probability of acceptance in the SMoDS context, such that the incurred long-term system performance is opti-mized. The formulation of the Markov Decision Process as the environment, the development of the Reinforcement Learning algorithm, especially the neural net-work designed for value function approximation, and a number of computational experiments that validate the effectiveness of RL are all addressed. These contents correspond to the third and last building block in Fig.1-3.

 Chapter6 concludes this thesis with a brief summary and future work that will be explored to put these contributions into practice.

(32)
(33)

Chapter 2

Preliminaries

In this chapter we present the preliminaries behind the main methodologies of this thesis work. We first introduce the basics of mathematical optimization in Section 2.1, which is the fundamental mechanism that we apply in dynamic routing. Next in Section 2.2, we present the fundamentals of Reinforcement Learning, the paradigm with which an agent learns to take actions such that expected cumulative reward is maximized, serving as an effective enabler for the dynamic pricing strategy that we have developed. Finally, we describe the theories for constructing and training neural network models that are specialized in dealing with sequences in Section 2.3, which we have incorporated in the development of the RL algorithm.

2.1

Mathematical Optimization

2.1.1

General Formulation

Mathematical optimization, or mathematical programming, is quietly a universal mod-elling framework where an objective described by a function f(x) is to be minimized1, through picking appropriate values for a set of decision variables x∈ RN, which satisfy a

(34)

finite set of constraints hi(x) ≤0,∀i ∈ [M], where N, M∈ Z>0and[M] , {1, 2,· · · , M}2. Formally, min x f(x) (2.1) subject to hi(x) ≤0, ∀i∈ [M] (2.2)

The properties of f(x) and hi(x) as well as the support of x define the characteristics of the optimization problem that is to be solved [32].

2.1.2

Taxonomy

Now we provide a taxonomy of the optimization problems defined via Eqs. (2.1) through (2.2). Note that this taxonomy is not exhaustive but rather aims at providing a brief in-troduction to the classes of optimization problems that will be encountered in this thesis. If both f(x) and hi(x) are linear in x, i.e., in the form of f(x) = cTx and hi(x) = aiTx where c, ai ∈ RN,∀i ∈ [M], then the corresponding problem is termed as linear program-ming, otherwise nonlinear programming. If x are constrained to take only integer values, then it corresponds to Integer Programming, and if part not all elements in x have to be integers, then it is a Mixed Integer Programming (MIP). For MIPs, if the objective function is quadratic in x in the form of f(x) = xTQx+qTx where Q ∈ RN×N and q RN while hi(x) are linear in x, ∀i ∈ [M], the corresponding optimization problem is then a Mixed Integer Quadratic Programming (MIQP). If each constraint is quadratic as well such that hi(x) = xTQix+qTi x ≤ 0 where Qi ∈ RN×N, qi ∈ RN,∀i ∈ [M], a Mixed Integer Quadratically Constrained Programming (MIQCP) is defined.

Moreover, if f(x) and each of hi(x) are convex functions of x, the underlying opti-mization problem falls into an important class of problems, termed as convex optiopti-mization.

2Similarly, for constraints in the form of h

i(x) ≥ 0, one could rewrite as −hi(x) ≤ 0. And equality

constraints, i.e., hi(x) =0, could be converted equivalently as hi(x) ≤0 and−hi(x) ≤0. Constraints with

(35)

Convexity is in great favor since every local minimum is a global minimum and hence the problem can be solved to optimality through a number of off-the-shelf numerical ap-proaches. If strictly convexity is in place, the global minimum is unique [34]. In practice, problems that do not satisfy the convexity condition are relaxed to the neighboring con-vex counterpart and the corresponding solution is further modified to be feasible for the original problem [55,56,59]. For example, neither MIQP or MIQCP is convex but con-vex relaxation can be carried out by solving the neighboring problem with the integer constraints neglected first and then rounding the solution to the nearest integers. This convex relaxation does not guarantee optimality though. In fact, for MIQP or MIQCP, if the integer constraints correspond to taking value from a finite set of integers, there are algorithms, e.g., branch-and-bound [93], that are able to guarantee to reach the optimal solution given sufficient amount of computational time.

2.2

Reinforcement Learning

Reinforcement Learning (RL) is a field in machine learning where an agent learns what to do, i.e., how to take actions in the given situations, with the goal of maximizing expected cumulative reward3. The learning is carried out via the agent interacting with the envi-ronment, without any exact guidance on what to do, while is rather given incremental reward signals in a trial-and-error fashion, which is very much similar to how human beings, or any creatures, learn to behave [131].

Two of the key features of RL lie in the notion of expected cumulative reward, where expected counts for the potential stochasticity embedded in the environment, the reward that the agent is able to accumulate is partially under control through actions taken while partially influenced by the stochastic environment; and cumulative indicates that the agent might have to take multiple actions sequentially spanning some horizon, where reward could be delayed and therefore short-term reward should be sacrificed under certain

cir-3This is under the reward hypothesis that all goals can be described by the maximization of expected

(36)

cumstances in order to balance long-term one.

RL is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

2.2.1

Markov Decision Process

Markov Decision Process (MDP) is typically used as the environment for RL [31]. An MDP consists of an agent4and the environment. MDPs are defined via a tuple⟨𝒮,𝒜,𝒫,ℛ, γ⟩.

𝒮 denotes the state space which is the set containing all possible states, denoted as s, of the environment; 𝒜 denotes the action space which consists of all possible actions, denoted as a, that the agent can take; 𝒫 denotes the state transition function capturing the dynamics of the MDP which is a function of the state s of the environment and the action a taken by the agent and outputs the probability distribution of the next possible state s′ ∼ 𝒫a

ss′; ℛdenotes the reward function which outputs the reward signal r = ℛas

with the agent taking the action a at state s; and γ denotes the discount factor which on the one hand makes the formulation more tractable mathematically and on the other hand takes into consideration that future reward is typically less valued than current reward from a practical perspective. The state transition function Pssa′ and the reward function

ℛa

s are denoted together as the model of the MDP, both of which could be stochastic. The state s defined in an MDP is required to satisfy the Markov Property defined as follows.

Definition 1(Markov Property). A state skis Markov if and only if

P[sk+1|sk] =P[sk+1|s1, s2,· · · , sk] (2.3)

where k ∈ Z>0 denotes the time index. Markov Property indicates that the future is independent of the past given the present.

A policy is defined as a mapping from state to action, which could be a deterministic one and therefore described by the function a = π(s) or a stochastic one described by

(37)

a conditional probability distribution a ∼ π(a|s). It can be proved that for any MDP,

there always exists a deterministic policy that is optimal, therefore we do not consider stochastic policies in this thesis and use π(·)as the general notion for a (deterministic) policy. The objective of the agent is then to derive a policy, such that the expected total discounted reward is optimized. Formally, the expected total discounted reward that can be accumulated starting from state s1following the policy π(·)is defined as follows.

Rπ(s 1) = E "

k=1 γkℛπsk(sk) s1 # (2.4)

The policy that maximizes Rπ(s) defined in Eq. (2.4), s ∈ 𝒮, is termed as the optimal

policy, and denoted as π*(·).

The agent learns to derive the optimal policy through interacting with the environ-ment, in the following way: (i) the agent is at state s, and takes an action a = π(s)

fol-lowing some behavioral policy π(·), which is typically not the optimal one, and will be improved gradually; (ii) the agent then receives a reward r from the environment follow-ing the reward function r = ℛas; (iii) the environment transits to a new state s′following the state transition function s′ ∼ 𝒫a

ss′. Through this procedure, the agent collects more

information regarding the environment and therefore learns to take better actions.

Now we define the value function of an MDP. There are two definitions of the value function, v(·), the value function taking input of a state, and q(·,·), the action-value func-tion taking input of a state-acfunc-tion pair. Value funcfunc-tions are defined as the expected total discounted reward that can be accumulated start from a state or state-action pair, respec-tively, following a policy. Therefore the policy π(·) that the agent follows is typically indicated in the superscripts. Formally,

vπ(s k) =E "

i=0 γiℛπsk+i(sk+i) sk # (2.5) qπ(s k, ak) =E " ℛak sk + ∞

i=1 γiℛπsk+i(sk+i) sk, ak # (2.6)

(38)

Following the optimal policy π*(·), the agent realizes greater than or at least as good expected total discount reward starting from any state or stat-action pair, that is

vπ*(s) ≥vπ(s), and qπ*(s, a) ≥qπ(s, a), s∈ 𝒮, a∈ 𝒜, π Π (2.7)

where Π denotes the space consisting of all possible policies. The corresponding vπ*(·)

and qπ*,·) are termed as the optimal value function, and typically denoted as v*(·) and

q*(·,·)respectively for ease of notation. The deterministic optimal policy can be found by maximizing over q*(·,·), i.e.,

π*(s) =arg max a∈𝒜 q

*(

s, a), ∀s ∈ 𝒮 (2.8)

Or if the model of the MDP is available, π*(·)can be derived through v*(·)as

π*(s) =arg max a∈𝒜 " ℛas +γ

s′ 𝒫ssa′v*(s′) # (2.9)

where Eq. (2.9) follows the Bellman Optimality Equation [26] provided in Eq. (2.10) which essentially leverages the optimal substructure of the MDP and breaks down the expected total discounted reward into two parts, one is the immediate reward at the cur-rent step, and the other corresponds to the ongoing reward summarized into the value of the possible next state.

v*(s) = max a∈𝒜 " ℛas +γ

s′ 𝒫ssa′v*(s′) # (2.10)

With the fundamentals of MDP introduced above, we now provide a brief overview of several important concepts that have been exploited in the development of this thesis.

Observability refers to the extent to which the agents knows of the environment. If the agent has full knowledge of the state representation s that satisfies the Markov property,

(39)

then the MDP is fully observable. Otherwise the corresponding MDP is termed as Par-tially Observable Markov Decision Process (POMDP) where the observation of the agent is a proper subset of the environment state, which typically does not satisfy the Markov property [108,126]. An POMDP can be converted to a fully observable MDP via sev-eral approaches, e.g., deploying the history, i.e., a sequence of actions, observations, and rewards as the state representation, or utilizing the belief state, a probability distribution over possible environment states conditioning on the history, however, more computa-tional burden will be introduced [35,115].

Prediction and control refers to two problem categories in an MDP, the former corre-sponds to evaluate the future given a policy while the latter is to find the optimal policy such that the future is optimized. Prediction could be a subproblem of control, e.g, in the control algorithm policy iteration which iteratively improves the policy towards the optimal one, prediction is carried out to evaluate each intermediate policy to drive the direction of policy improvement; and in some approaches, prediction is not a must, e.g., value iteration, which we will elaborate in the next subsection.

Model-based or model-free classifies whether the MDP to be solved or the algorithm deployed leverages the model of the MDP, i.e., the state transition function𝒫∫ ∫⊣′ and the

reward functionℛas. If the model is available, it is referred to as model-based, otherwise if as least part of the model is not available or too costly to be exploited, it is model-free. Note that model-free approaches do not necessarily mean that the model can not be evaluated, but rather without or hard to leverage the full knowledge. Typical model-free approaches either learn to estimate the model first and then solve the MDP through model-based approaches or directly solve the MDP without explicitly learning the model. If v(·) is to be used, typically a model is needed according to Eq. (2.9) while q(·,·) is suitable for model-free problems as indicated in Eq. (2.8).

(40)

Value- or policy-based approaches or action-critic algorithms refer to a taxonomy of algorithms depending upon whether a value function is explicitly used. Value-based approaches learns the value function, v(·) or q(·,·), and then derives the policy, while policy-based ones directly parametrize the policy without deploying the value function as an intermediate step [128,132]. Compared with value-based approaches, policy-based ones have better convergence properties, are more effective in high-dimensional or con-tinuous action spaces and can learn stochastic policies, while typically converge to a local rather than a global optimum and could suffer from inefficiency and high variance when evaluating a policy. Actor-critic algorithms combines both value- and policy-based ap-proaches that achieves variance reduction by introducing a critic to estimate the value function which is used by the actor to enhance the policy [64,89].

Exploration and exploitation is a trade-off that any RL algorithm should be able to bal-ance so as to achieve good performbal-ance. The agent learns to take better actions through interacting with the environment. Exploration refers to collecting new information re-garding the environment towards a better informed policy through this interaction while reward along the way might not be optimal, and exploitation refers to leveraging known information to maximize reward. Typically an agent explores more first and then ex-ploitation dominates. Exploration is as important as exex-ploitation otherwise the agent may stuck in local minima, and thereby the two should be balanced.

2.2.2

Value Iteration

Value Iteration is a family of value-based algorithms that derive the optimal value function and hence the optimal policy of an MDP by carrying out Bellman optimality backup [53,134]. We explain this idea starting with an MDP with a discrete and finite state space 𝒮 = {s1, s2,· · · , sn}, where n ∈ Z>0denotes the number of possible states in the state space5. Denote v* = [v*(s1), v*(s2),· · · , v*(sn)] = [v*(si)]i∈[n] = [v*i]i∈[n]as the vector containing

5The subscripts index states in the state space, instead of corresponding to the time indexes used in other

(41)

the optimal value of all the states in the state space. To derive v*, value iteration starts with an initial guess v0 and iteratively carries out update via the mapping T defined as follows (Tv)i ,max a∈𝒜 " ℛas i+γ

si′∈𝒮 𝒫sa isi′v(si ′) # (2.11)

where si′denotes the possible next state of si. In the ithiteration, the update then follows

vi =Tvi−1 (2.12)

In fact, v*is a fixed point of the mapping T defined in Eq. (2.11). With a tabular represen-tation and full backup, value iteration can be proved to converge to the unique optimal value function at a linear convergence rate of γ. The proof is carried out using Contraction Mapping Theorem [107]. We firstly prove that T is a γ-contraction.

||Tv−Tv′|| =max i∈[n] |(Tv)i− (Tv ′) i| ≤max i∈[n]maxa∈𝒜 γ

si′∈𝒮 𝒫sa isi′ h v(si′) −v′(si′) i ≤γmax si′∈𝒮 v(si ′) − v′(si′) =γ||v−v′||∞ (2.13)

where v, v′ ∈ Rn. Since v and v

are arbitrarily chosen, and Rn is complete, T a γ-contraction. Therefore, value iteration is guaranteed to converge to the unique fixed point, which is v*, at a linear convergence rate of γ.

However, when the cardinality of the discrete and finite state space is large or the state space is infinite or continuous, a tabular representation or full backup is not prac-tical. These are because they are too costly for MDPs with a large state space in terms of both computational burden and memory requirement, while simply not feasible for those with an infinite or continuous state space. Instead, a representation with a function approximator and sample backup is used. The former is to use a value function

(42)

approx-imator ˜v(s; w) parameterized by weights w to approximate the optimal value function v*(s),∀s ∈ 𝒮 where |w| ≪ |𝒮 | to enable a compact representation, and the latter is to carry out backup using states from a proper subset of instead of the entire state space to enhance computational efficiency. With the application of function approximation, the update is applied to the weights, instead of to the value function directly. This version is typically termed as fitted value iteration, where the algorithm starts with the initial guess w0and applies Bellman optimality update to improve weights iteratively. In the ith itera-tion, the weights are updated from wi−1to wias follows

wi =arg min w

s∈𝒮˜i ( ˜v(s; w) −max a∈𝒜 " ℛsa+γ

s′∈𝒮 𝒫ssa′˜v  s′; wi−1 #)2 (2.14)

where the loss function which is defined as the mean squared error6between the updated estimate and the bootstrapping through Bellman Optimality Equation is minimized over states from the mini-batch ˜𝒮i ⊂ 𝒮. Typically, there is no performance guarantee for fitted value iteration, in terms of whether the algorithm converges, or if it converges, the con-vergence rate and or quality of the solution. We have deployed fitted value iteration in our proposed RL algorithm and will address theses issues in greater detail in the chapters that follow.

2.3

Neural Networks for Sequences

In this subsection, we provide a brief introduction to the neural network designs that spe-cialize in dealing with sequences that have been exploited in this thesis work, including fully-connected layers, Long short-term memory layers with the general recurrent neu-ral networks as a prerequisite and convolutional layers, as well as the approaches for learning the weights of the neural networks, including Adam, which is the optimization algorithm we have deployed in this thesis.

(43)

2.3.1

Feedforward Neural Networks

The feedforward neural networks are among the most basic classes of neural network mod-els, which typically consist of a number of neurons. Each of these neurons corresponds to a single scalar variable of the model, and are arranged into a layered configuration. In a feedforward architecture, the value of the neurons in one layer are only dependent on ones in the previous layer, therefore the information flows in a feedforward fashion. With the terminology from graph theory, a neuron is denoted to be connected to any other neurons in an adjacent layer if it either takes input from or injects output to them. The functional dependence for a single neuron has a universal form in feedforward neural networks as follows

z =awTx+b (2.15)

where z denotes the value of the neuron being evaluated, i.e., the output, and x represents the value of neurons in the previous layer which is structured as a vector, i.e., the input. w and b denote the weights and bias respectively. a(·)is termed as the activation function and is typically nonlinear by design, and therefore z is often referred to as the activation of the neuron. Typical activation functions include sigmoid function, tanh, ReLU, etc. Thus according to Eq. (2.15), the value of the neuron is obtained from the previous layer through an affine map first and then a nonlinear transformation. By varying the values of the weights w and bias b, one could obtain a somewhat limited range of functions mapping from the input x to the output z.

A complete feedforward neural network is arrived by stacking several of such layers on top of each other, i.e., successively applying function compositions, as illustrated in Fig. 2-1. As the number of layers and neurons increase, the expressive power of the resulting model significantly enhances, making it capable of accurately approximating a broad range of smooth functions according to universal approximation theorems [48]. The layers between the first layer, i.e., the input layer, and the last layer, i.e., the output layer, are typically termed as hidden layers. If more than one hidden layer is included, the corresponding neural network designs are referred to as deep neural networks, otherwise

(44)

shallow neural networks. If each neuron in a specific layer is connected with each neuron in the layers both before and after (if any) this layer, the corresponding layer is fully-connected.

input output

hidden

Figure 2-1. Illustration of a feedforward, deep, and fully-connected neural network design.

nh ∈ Z>0 denotes the number of hidden layers. The input, output layers, the first and the last

hidden layers are plotted. The neural network maps a 6-dimensional input x to a 3-dimensional output z.

For any given input x and any fixed network structure, the model output z(x, θ) can be easily calculated through Eq. (2.15) and the quality of prediction on the given dataset can be measured by evaluating a cost function defined as follows

𝒥 (θ) =

n

ℒhznxn, θ, yni (2.16)

where 𝒥 (θ) denotes the cost function parameterized by θ which represents the set of

all model parameters including weights and biases. y denotes the desired ground truth corresponding to a prediction z from from the input x. n ∈ Z>0 indexes the datapoint in the dataset. The cost function 𝒥 (θ) sums the loss function ℒ(·,·)which measures the

difference between a single prediction-truth pair over all datapoints evaluated across the dataset. This overall cost is minimized with respect to parameters θ, which defines the model that will then be utilized to make predictions for unseen input. This minimiza-tion can be realized conveniently via a number of gradient based approaches. For this

(45)

purpose, the gradient of the cost function can typically be computed via a well-known algorithm known as backpropagation [120] which calculates the error and gradient first for the last layer and moves layer-by-layer in a backward fashion, and hence the name. This algorithm enables the use of any gradient- and mini-batch-based approaches that lead to the minimization of the overall cost. We provide a brief survey of some popular optimiza-tion algorithms, especially ones that have been deployed in this thesis work, in Secoptimiza-tion

2.3.5.

2.3.2

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a class of deep neural network architecture that can be viewed as a nonlinear dynamical system that can take input in the form of sequences, and output either sequential or non-sequential data. The most distinct feature of an RNN is the presence of hidden states whose values are dependent on those of the previous tem-poral steps and the current input. These units represent the internal memory of the model and are designed to address the sequential nature of the input, i.e., the same combination of inputs arranged in different order outputs different results. The output sequence is determined by the hidden state at each step. A non-sequential output can also be realized by simply outputting that from the last RNN unit. Fig. 2-2shows the unfold structure of an RNN model.

The model is parameterized by weigh matrices{Whx, Whh, Woh}and bias vectors{bh, bo}, which are once again jointly denoted by θ. Given an input sequence (x1, x2,· · · , xT) where T ∈ Z>0 denotes the horizon, the RNN computes a sequence of hidden states

(h1, h2,· · · , hT)and a sequence of outputs(z1, z2,· · · , zT)as follows

ht =σh(Whxxt+Whhht−1+bh) (2.17)

zt =σz(Wohht+bo) (2.18)

Figure

Figure 1-1. Operating procedure of a Shared Mobility on Demand Service: request, offer, de- de-cide, and operate.
Figure 1-2. Illustration of a transactive controller in a smart infrastructure. A mechanism through which system- and component-level decisions are made through economic transactions between the components of the system, in conjunction with or in lieu of t
Figure 1-3. Proposed SMoDS design with integrated dynamic routing and dynamic pricing.
Figure 2-1. Illustration of a feedforward, deep, and fully-connected neural network design.
+7

Références

Documents relatifs

Complementary to these acceptability questionnaires, we also assessed human needs which were perceived as relevant in the context of autonomous mobility on demand,

Or comme dit précédement, les performances de BatchQueue dans le cas d’une communication sans calculs matriciels sont les plus intéressants car ils donnent une idée des

Suchi Saria has received research support from American Heart Association, Child Health Imprints, Defense Advanced Research Projects Agency, National Institutes of

Afin d’obtenir un système dynamique polynomial associé à une trajectoire discrète de longueur m, il suffit de calculer m − 1 séparateurs ; si le résultat n’est pas optimal

independent investigations suggest that changes in clay mineral provenance occurred in the earliest and latest Pleistocene sedi- ments of the western North Atlantic

In their conceptualization, each dimension of agency (past, present and future) involves some level of orientation towards the other temporalities. However, in

The nine theorems tell the manager that as the result of a competitive entrant (1) the defender's profit will decrease, (2) if entry cannot be prevented, budgets for

In this chapter, after analysing the different flows involved with airport security, the inspection function of passengers, agents and luggage, is described and