Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoder: DeepBreath

(1)

Publisher’s version / Version de l'éditeur:

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la

première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez

pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.

Questions? Contact the NRC Publications Archive team at

PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the

first page of the publication for their contact information.

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site

LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

Expert Systems with Applications, 156, 2020-04-18

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

NRC Publications Archive Record / Notice des Archives des publications du CNRC :

https://nrc-publications.canada.ca/eng/view/object/?id=aca48a15-8e8f-4f94-942a-e498d5e7fe7d

https://publications-cnrc.canada.ca/fra/voir/objet/?id=aca48a15-8e8f-4f94-942a-e498d5e7fe7d

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. /

La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version

acceptée du manuscrit ou la version de l’éditeur.

For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien

DOI ci-dessous.

https://doi.org/10.1016/j.eswa.2020.113456

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

Financial portfolio optimization with online deep reinforcement

learning and restricted stacked autoencoder: DeepBreath

(2)

ContentslistsavailableatScienceDirect

Expert

Systems

With

Applications

journalhomepage:www.elsevier.com/locate/eswa

Financial

portfolio

optimization

with

online

deep

reinforcement

learning

and

restricted

stacked

autoencoder—DeepBreath

Farzan

Soleymani,

Eric

Paquet

∗

National Research Council, 1200 Montreal Road, Ottawa, ON K1K 2E1, Canada

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 3 October 2019 Revised 10 February 2020 Accepted 13 April 2020 Available online 18 April 2020

Keywords:

Portfolio management Deep reinforcement learning Restricted stacked autoencoder Online leaning

Settlement risk Blockchain

a

b

s

t

r

a

c

t

Theprocessofcontinuouslyreallocatingfundsintoﬁnancialassets,aimingtoincreasetheexpected re-turnofinvestmentandminimizingtherisk,isknownasportfoliomanagement.Inthispaper,aportfolio managementframeworkisdevelopedbasedonadeepreinforcementlearningframeworkcalled Deep-Breath.TheDeepBreathmethodologycombinesarestrictedstackedautoencoderandaconvolutional neu-ralnetwork(CNN)intoanintegratedframework.Therestrictedstackedautoencoderisemployedinorder toconductdimensionalityreductionandfeaturesselection,thusensuringthatonlythemostinformative abstractfeaturesareretained.TheCNNisusedtolearnandenforcetheinvestmentpolicywhichconsists ofreallocatingthevariousassetsinordertoincreasetheexpectedreturnoninvestment.Theframework consists ofbothofflineand online learning strategies:the former isrequired to traintheCNN while thelatterhandlesconceptdriftsi.e.achangeinthedatadistributionresultingfromunforeseen circum-stances.Thesearebasedonpassive conceptdriftdetectionandonline stochasticbatching. Settlement riskmayoccurasaresultofadelayinbetweentheacquisition ofanassetanditspaymentfailingto deliverthetermsofacontract.Inordertotacklethischallengingissue,ablockchainisemployed.Finally, theperformanceoftheDeepBreathframeworkistestedwithfourtestsetsoverthreedistinctinvestment periods.Theresultsshowthatthereturnofinvestmentachievedbyourapproachoutperformscurrent expertinvestmentstrategieswhileminimizingthemarketrisk.

1. Introduction

Acollectionoffinancialassets,knownasaportfolio,may con-sistoffinancialinstrumentssuchasstocks,bonds,currencies,and cryptocurrencies. The process of reallocating the various assets formingtheportfolio,inordertoincrease theexpectedreturnon investment, is called portfolio management. The financial instru-mentsmanagedinsuchaportfoliomaybeassimilatedtotime se-ries, which oftenexhibit complex behaviors, due toexternal fac-torssuchastheglobaleconomyandthepoliticalclimate.Asa re-sult, their behaviorsare inherently nonlinear,uncertain and non-stationary(Malkiel,2003;Tsai&Hsiao,2010).

Generally, the erratic and complex behavior of the ﬁnan-cial market is determined by endogenous and exogenous factors (Filimonov&Sornette, 2012). Inrecentyears,duetotheconstant riseofsocialmedia,suchasTwitterTM_and_FacebookTM_,_it_has_been

∗ _{Corresponding author.}

E-mail addresses: Farzan.Soleymani@nrc-cnrc.gc.ca (F. Soleymani),

Eric.Paquet@nrc-cnrc.gc.ca (E. Paquet).

implicitlyassumedthat market evolutionessentially results from theseexogenousfactors.Thisassumptionfindsitsorigininthe ef-ficientmarket hypothesis (Fama, 1991; Malkiel & Fama, 1970) in whichthepriceofan assetfullyreflectsall availableinformation. Nevertheless,various studies havedemonstratedthe falsehood of thishypothesisshowingthat theseexogenous factorshavea lim-ited impact on markets where behavior is essentially driven by endogenousfactors(Bouchaud,2011; Cutler,Poterba,& Summers, 1988;Joulin,Lefevre,Grunberg,&Bouchaud, 2008).The extentto whichthemarketisendogenousisdeterminedbythespectral ra-diusassociatedwiththeunderlyingstate-dependentHawkes pro-cess(Filimonov&Sornette,2012).

When managing a portfolio, one must take into account the volatilityandthecorrelationassociatedwiththevariousfinancial instruments.Portfoliomanagement isan investmentstrategythat aimsatmaximizingtheexpectedreturnoninvestment(ROI)while minimizingthefinancialriskbycontinuouslyreallocatingthe port-folioassetsi.e.by attributing theproper weightto eachfinancial instrument. With recent advances in machine learningand deep learning, ithas become possibleto predict complex financial

be-https://doi.org/10.1016/j.eswa.2020.113456

(3)

haviors andto automatize the decision process;at least up to a certainextent(Jangmin,Lee,Lee,&Zhang,2006).

Stock market behavior prediction with an artiﬁcial neu-ral network (ANN) was proposed for the ﬁrst time by

Atsalakis and Valavanis (2009). Artiﬁcial neural networks are data-driven machine learningtechniques which do not requirea completepriormodelofthedatagenerationmechanism. Nonethe-less,traditionalmultilayerneuralnetworksdonottakeadvantage ofthe timeseriesstructure associatedwithﬁnancialdata(Eakins & Stansell, 2003; Hussain, Knowles, Lisboa, & El-Deredy, 2008; Lam,2004).

Moreover,theperformanceofanANNisstronglydependenton the network architecture, the learning, andhyper-parameters, as well as the training, set just to mention a few. Therefore, other alternatives have been explored in order to improve the accu-racy such as support vector machine (SVM), random forest (RF) andnaïve Bayes with seminal work by Patel, Shah,Thakkar, and Kotecha(2015) withnon-signiﬁcantimprovement.Amajor break-throughcamewiththeintroduction ofdeepreinforcement learn-ing (Jiang, Xu, & Liang, 2017) in which the policy, which deter-minestheactionstakenbytheagent,iseitherlearnedwitha con-volutionalneural network (CNN) orwithsome kindof recurrent neural network (RNN) such as the long-short-term memory net-work(LSTM).Intheportfoliomanagementcontext,theactionsare theweightsattributedtothevariousassetsformingtheportfolio.

Financial data are particularly suited for deep learning be-cause of the very large amount of data available for training (Goodfellow, Bengio, & Courville, 2016). Nevertheless,deep learn-ingalgorithmsperformancesareaffected bymultiplefactorssuch astheactivation and poolingfunctions, noise, networkstructure, which mayconsist of multiple sub-networks as well as the fea-turesselection mechanism. In the ﬁnancial context, noise reduc-tionhasbeenachievedwithwavelettransforms(Bao,Yue,&Rao, 2017) whilemoreinformative representations anddimensionality reduction were obtained with Boltzmann machines and autoen-coders (Hinton & Salakhutdinov, 2006) among other techniques. Forlargeportfolios, dimensionalityreduction isdesirableinorder to optimizethe learning process while avoiding the curse of di-mensionality(Sorzano,Vargas,&Montano,2014).Varioustypesof encodershavebeenemployed suchasconvolutionalautoencoders (Cheng, Sun, Takeuchi, & Katto, 2018) and stacked Autoencoders (Baoetal.,2017)justtomentionafew.

In this paper, we propose an online framework for portfolio managementbasedondeepreinforcement learningandrestricted stacked autoencoderfor feature selection. High-level features are extracted from the input data which contain eleven (11) corre-latedfeatures.Therestrictedstackedautoencoderextractsnew in-formative features while reducing the dimensionsdown to three (3).Therefore, the training maybe completed very rapidly mak-ing this approach suitable for online learningand online portfo-liomanagement. The investment policy isenforced using a CNN. The algorithm is initially trained offline which results in a pre-trainedmodelfromhistoricaldata.Then,theweightsareupdated followinganonlinelearningschemeinordertofollowthe evolu-tionofthemarket.Bothofflineandonlinestrategiesareemployed forlearning: theformerbeingbasedon passiveconcept drift de-tection (Gama, Indr˙e,Bifet, Pechenizkiy, & Bouchachia,2014) and the latter being based on online stochastic batching. Using pas-siveconceptdriftdetection,whichallowsthesystemtolearnnew emergingpatternsin themarket whilepreservingtheknowledge acquiredduringofflinetrainingwithhistoricaldata.Inorderto fa-cilitatetheauditandtosecurethesettlementprocess,ablockchain isemployed.

The paperis organizedasfollows.The underlying mathemati-calformalismassociatedwithportfoliomanagementisintroduced in Section 2. Features normalizationand selection are addressed

Fig. 1. Weight reallocation process. The portfolio weight vector w ′

t−1 and the port- folio value P ′

t−1 at the beginning of period t evolve to ( t ) and P t′ respectively. At that moment, assets are sold and bought in order to increase the expected return on investment. These operations involve commission fees which shrink the portfolio value to P t which result in new weights w t .

inSection 3.Ourdeep reinforcementlearningframework, named DeepBreath, is presented inSection 4.This framework integrates passiveconceptdriftdetectionandonlinestochasticbatching.This is followed,in Section 5, by a description of the blockchain em-ployed for settlement. Our experimental results are presented in

Section6.AconclusionfollowsinSection7.

2. Mathematical model for portfolio management

The trading period for financial instruments on the stock ex-changeis onecalendar day.During thisperiod,the price fluctua-tionsassociated withafinancial instrumentmaybe characterized by four metrics namely the priceat the opening of the market, thelowestandthehighestpricesreachedduringthedayand,the priceattheclosureofthemarket.Based onthesecharacteristics, thissectionprovidesamathematicalframeworkforportfolio man-agement.Anapproachsimilarto(Jiangetal.,2017),whichwas in-troducedearlierbyOrmosandUrbán(2013),isfollowed.

The price vector V t consists of the closing price of all assets formingtheportfolioatthetimeinadditiontotheamountofcash availablefortrading(hereinUSD).Thenormalizedpricevector Y t isdeﬁnedastheelement-widedivisioninbetweentheprice vec-toratthetimetandt−1:

Yt:=VtV_t−1=

1, V1,t V1,t−1 , V2,t V2,t−1 ,..., Vm,t Vm,t−1

T . (1)

Initially,theportfolioconsistssolelyofcashwhichmeansthatthe weightassociated withthecash isone (1),whiletheweights as-sociated with the other ﬁnancial instruments are zero (0). This meansthatnopriorassumptionismadeabouttheportfolio com-position: theweightallocation isentirelyperformedby ourdeep reinforcement learning algorithm. Therefore, no other prior as-sumptionabouttheportfoliocompositionismade:

w0=[1,0,...,0]T. (2)

Inordertobuyortosellanasset,onemustpayacommissionfee. Asillustrated inFig.1,owingtopricesﬂuctuations, atthe endof periodt,thevalueoftheportfoliovectorisdeterminedby w′

t=

Ytwt−1 Yt·wt−1

, (3)

where w t−1 istheportfolioweightvectoratthebeginningof pe-riodt.Atthatmoment,theportfolioisreallocatedbybuyingand selling assetsin orderto increase the expectedreturnon invest-ment;therefore,theportfolioweightvectorevolvesfrom w ′

t to w t. The portfolio reallocation involves commissionfees which shrink theportfoliovaluebyafactorof

µ

∈(0,1],

Pt=

µ

tPt′. (4)

Theportfoliovalueattheendofperiodtisgivenby

Pt=

µ

tPt−1·Yt·wt−1. (5)

The return ratefor period t and the logarithmic return rate are

deﬁnedas

ρ

t:= Pt Pt−1 −1=

µ

t·P ′ t Pt−1 −1=

µ

tYt·wt−1−1, (6a)

(4)

Table 1

Financial indicators employed as features. Financial indicators Deﬁnition

Average True Range The average true range is an indicator that measures market volatility by decomposing the entire range of an asset price for that period.

Commodity Channel Index The commodity channel index is a momentum-based oscillator used to help determine when an investment vehicle is reaching a condition of being overbought or oversold

Commodity Selection Index The commodity selection index is a momentum indicator that attempts to identify which commodities are the most suitable for short-term trading.

Demand Index The demand index is an indicator that uses price and volume to assess buying and selling pressure affecting a security.

Dynamic Momentum Index The dynamic momentum index is a technical indicator used to determine if an asset is overbought or oversold.

Exponential Moving Average An exponential moving average is a type of moving average that places a greater weight and signiﬁcance of the most recent data points.

Hull Moving Average The hull moving average solves the age old dilemma of making a moving average more responsive to current price activity whilst maintaining curve smoothness.

Momentum The momentum is the rate of acceleration of a security’s price or volume – that is, the speed at which the price is changing.

Rt :=ln

Pt

Pt−1

=ln

(

µ

tYtwt−1

)

−1. (6b)

Therefore,theﬁnal value oftheportfolioforatime horizontf isgivenby Pf=P0·exp

_t f+1 t=1 Rt

=P0 tf+1 t=1

µ

tYtwt−1. (7)

Consequently,thetotalamountofcashgainedCgbysellingassets is Cg=

(

1−cs

)

Pt′ m i=1 ·ReLU

w′ t,i−

µ

twt,i

, (8)

where theselling commissionrate isdenoted by 0 < Cs < 1.As a result, the cash reserve P′

tw ′t,0 becomes

µ

tPt′w t,0 which corre-spondstothenewamountofcashavailableforacquiringnew as-sets.Ontheotherhand,theamountofcashspentCsis

Cs=

(

1−cb

)

w′ t,0+

(

1−cs

)

m i=1 ReLU

(

w′ t,i−

µ

twt,i

)

−

µ

twt,0

= m i=1

ReLU

(

µ

twt,i−w′t,i

)

. (9)

whilethebuyingcommissionrateisgivenby0<Cb<1.Withthe help of w′

t,0+mi=1w ′t,1=1=w t,0+m

i=1w t,i and ReLU

(

x−y

)

−

ReLU

(

y−x

)

=x−ytheEq. (9)yields:

µ

t= 1 1−cbwt,0 ×

1−cbw′t,0−

(

cs+cb−cscb

)

m i=1

ReLU

(

w′t,i−

µ

twt,i

)

. (10)

The transaction factor

µ

t is a function of therelative price Y as wellasofthecurrentandpreviousweightvectors:

µ

t=

µ

t

(

wt−1,wt,Yt

)

. (11)

This implicitequation cannot be solved analytically. However, an approximation for the shrinking factor may be obtained through an iterativeprocess. Ithas beendemonstrated (Jiang etal., 2017) that the sequence generatedby Eq. (10) always convergeto the rightsolution.

Theorem 1. Givenaninitial estimate

µ

0 ,thesequence

˜

µ

(_tk)

de-ﬁnedby

_˜

µ

(_tk)

|

µ

˜_t(0)=

µ

0and

µ

˜t(k)=f

_˜

µ

_t(k−1)

,k∈N

, (12) f

(

µ

)

:= ₁ 1 −cbwt,0 ×

1−cbw′t,0−

(

cs+cb−cscb

)

m i=1

ReLU

(

w′t,i−

µ

twt,i

)

, (13)

alwaysconvergetothesolutionofEq. (10)forany

µ

0∈[0,1].

Theconvergenceratedependsontheinitialvalueofthe trans-actionfactor

µ

t.According to (Moody,Wu, Liao, & Saffell, 1998), afast convergenceratemaybe obtainedifthe initial transaction factor

µ

0ischosenas

µ

0=c m i=1

|

w′ t,i−wt,i

|

. (14)

The transactionfactor

µ

t isa function ofthe relative price Y t as well asthe current andprevious weight vectors: In the present work,thefollowingassumptionsaremade:

• itisalwayspossibletobuyortosellanassetatanytime. • thetransactionsdonotaffectthevaluesoftheﬁnancial

instru-mentsinvolved.Inotherwords,thenumberofassetstradedby theagentissmallcomparedtotheoverallmarketliquidity. In order to optimize the learning process, data normalization andfeatureselectionareaddressedinthenextsection.

3. Features normalization and selection

Asmentionedearlier,eachfinancialinstrumentischaracterized byfourprices:theopeningprice,thelowestandthehighestprices reachedduringthetradingperiodaswellasthepriceatthe clos-ingof themarkets. Inaddition,financial instruments maybe de-scribed by a plethora of financial indicators such as the average truerange,theexponentialmovingaverage,thecommodity chan-nel index among others (Murphy, 1999): each index characteriz-ingaspecific behavioroftheunderlyingasset. Thefinancial indi-catorsemployed inourframework are describedin Table1;they areamongthemostcommonlyemployedinthefinancialindustry (Achelis,2001).

Featuresnormalizationisaddressedinthenextsubsection.

3.1. Featuresnormalization

The portfolio is characterized by twelve (12) feature vectors. Eachfeaturevectorrefers toaparticularfeature forall theassets forming theportfolio. The ﬁrst four vectorscontain the opening, low,highandclosingpriceofall the assetsrespectively. Thelast eightvectors contain theﬁnancial indicators forall the assetsas

(5)

describedinTable1.Inorderfortheagenttoeﬃcientlylearnthe policy,thesefeaturesmust benormalized(Ross, Mineiro,& Lang-ford,2013).Becauseboth theextrema andtheeffectiverangeare unknown a priori, it isnot possible tonormalize them with the min-maxapproach.Therefore,thelow,highandclosingpricesare normalizedwithrespecttotheopeningprice:

V(_tLo)=

Lo

(

t−n−1

)

Op

(

t−n−1

)

,..., Lo

(

t−1

)

Op

(

t−1

)

T , V_t(Cl)=

Cl

(

t−n−1

)

Op

(

t−n−1

)

,..., Cl

(

t−1

)

Op

(

t−1

)

T , V(_tHi)=

Hi

(

t−n−1

)

Op

(

t−n−1

)

,..., Hi

(

t−1

)

Op

(

t−1

)

T . (15)

Suchnormalizationismoreinformativethanthemin-max normal-izationscheme (Hussain etal., 2008) or theZ-score asit clearly identiﬁesthe markettrend (upanddown). Asimilar approach is followedfortheﬁnancialindicatorswhicharenormalizedwith re-specttotheirprevioustradingperiod:

V(_tFI)=

FI

(

t−n

)

FI

(

t−n−1

)

,..., FI

(

t

)

FI

(

t−1

)

T . (16)

Asaresult,thefeaturetensorconsistsofthree(3)pricevectorsas deﬁnedinEq.(15)aswellasofeight(8)ﬁnancialindicatorvectors asdescribedbyEq.(16).

Policies are more diﬃcult to learn when the feature tensors havealargenumberofdimensionswhichinturnhasa detrimen-taleffectonthesystemperformance(Choung,Lee,&Jeong,2017). Suchasituationmayoccur whentheportfolioconsistsofalarge numberofassets;acommonplacesituationinreallife.Inaddition, some of these features may be highlycorrelated which may re-sultinsub-optimallearning;nottomentionthatthetraining pro-cessmaybecomecomputationallyprohibitive.Inordertoalleviate theseproblems,asmallernumberofnon-correlated andmore in-formativefeatures must be extractedfrom theoriginal ones. The

next section describes our features selection approach which is basedonanunsupervisedstackedrestrictedautoencoder.

3.2. Featuresselection

Features selection is performed withan unsupervised stacked restrictedautoencoder inspiredby Wong andLuo (2018).The in-putfeaturevectorconsistsofeleven(11)featuresnamelythe nor-malizedlow,highandclosingpricesaswellastheeight normal-izedﬁnancialindicators;alltakenatthesametimet.Thenetwork isdesignedin ordertoextract low-dimensional uncorrelated fea-tures while optimizing the training process. Autoencoders are a typeofunsupervisedfeedforwardneuralnetworkthatreconstructs the output fromthe input. It isformed of two parts namelythe encoderandthedecoder.Theencoder reducesthedimensionality oftheinputdataintheso-calledlatentofhidden-layerwhilethe decoder reconstructs the input data from the hiddenlayer; such anetwork issaidtobe under-completebecauseoftherestriction imposedbythelatentlayer(Bengio,Goodfellow,&Courville,2015). Autoencodersare capableoflearningnonlinear dimension reduc-tion functions, removing redundanciesandcorrelations while ex-tractinghighlyinformativefeatures(Bengioetal.,2015).

Asmentionedearlier,weemployedastackedrestricted autoen-coder. Thisnetwork consists ofseven (7)layers: one input layer, five(5)hiddenlayersandoneoutputlayer.Thenetworkstructure isillustratedinFig.2.Thenetworkisasymmetrical:thefirst hid-denlayerhas(10)neurons,thesecond(9)neuronswhilethenext three consist of (3) neurons each. The feature extraction is per-formedatthelevelofthethirdhiddenlayer.Theoutputlayeralso hasonly(3)neurons.Thenetworkissaidtobe restrictedbecause thedecoder performsa partialreconstruction oftheinputvector. Indeed,thethreeneuronsoftheoutputvectorrefertothe normal-izedlow,highandclosingprices.Therefore,thefinancialindicators arenotreconstructedbythedecoder.Consequently,ourencoderis not only under-completebut alsoconstrained: theconstraint be-ingthe partialreconstructionatthe levelofthedecoder.As later

(6)

demonstrated bythe experimental results,thepartial reconstruc-tion constraintgenerates more informative feature vectors while puttingtheemphasisonwhatwasforeseenasthemostimportant features.

Ourdeepreinforcement learningalgorithmisdescribedinthe nextsection.

4. Deep reinforcement learning

A deep reinforcement learning algorithm (Ivanov & D’yakonov, 2019) consists of an agent that takes the best ac-tions inthe environmentgiven thestate ofthelatter inorderto maximizearewardr.Eachactiontakenbytheagentisassociated withareward.Thebestactionisdeterminedbyapolicy whichis learned withdeeplearningtechniques. Theagentseeks to maxi-mize the rewardby takingproper actions. Inour framework the action consists in determining the optimalweight vector forthe portfolio inorderto increase theexpectedreturn oninvestment:

at=wt. (17)

Due to therelation inbetween thereturn rate,Eq. (6a),andthe transaction factor, Eq. (11), the current action is partially deter-mined bytheprevious one.Therefore,thestatevector consistsof the featuretensor attime t aswell astheweight vector attime t−1(thestateoftheportfolioatthistime):

st=[Xt,wt−1]T. (18) The portfolio weight vector w t−1 andthe features tensor X t cor-respondtotheinternal(orlatent)andexternalstatesrespectively. As previouslymentioned, theprimary objectiveistoincrease the portfolioreturnoninvestmentwhichisdeterminedbyEq.(7)over an investmenthorizont_f+1.Itfollowsthattherewardis associ-ated with the logarithmic accumulated return R, which may be inferredfromEqs.(6b),(7)and(11)):

R

₍

s1,a1,...,st,at,st+1

)

= 1 tf ·ln

Pf P0

, = 1 tf · tf+1 t=1 ln

(

µ

t·Yt·wt−1

)

= 1 tf · tf+1 t=1 Rt. (19)

Thealgorithmemployedtolearnthepolicyisdescribedinthenext subsection.

4.1. SARSA

Portfolio managementisconsidereda decisionmaking process in which the agent constantly takes actions in order to increase the portfolio value. The SARSA algorithm (Sutton & Barto, 2018) isemployed totrain aCNNinorder tolearnthe investment pol-icy. This network is described in the next subsection. SARSA is an on-policywhich meansthatthe previousactions takenbythe agentaretakenintoaccountinthelearningprocess.Thisisin con-trast withthe Q-learning algorithm in which the optimal policy is learned irrespectively of the actions taken by the agent. Since on-policy methodsaresimplerandconvergefasterthantheir off-policy counterpart, they are more adapted to portfolio manage-ment(Sutton&Barto,2018).

OurimplementationofSARSA isdescribedinAlgorithm1.The normalizedfeatures tensoris fedto thestakedrestricted autoen-coder for features selection and dimensionality reduction. These newfeatures are employed totrain theconvolutional neural net-worki.e.todeterminetheweightsassociatedwiththekernel.The statetensorandtheactionvectorareinitializedbasedontheinput tensoraswell asontheprevious andcurrentweightvectors.The initial weightvector isgivenby Eq.(2)andconsistsonly ofcash

Algorithm 1 SARSA(

ε

-greedy).

Data :

α

∈[0,1]Learningrate

Data :

ε

∈[0,1]epsilonvalue

Data :

β

d=1×10−2epsilondecayrate

Data :Nt,Bn,r: Number of iterations, Batch Number, Immediate Reward

Data :

π

θ,Q:Policy,state-actionvalue Data :w ˜ t:Selectedactionat(t+1)

Data :A_,S:Actionspace,StateSpace

1 Initialization:

iteration=0

initial

ε

←0.6≤

ε

≤0.85

while iteration≤Nt do

2 if random_number<

ε

then

3 w ˜ t←random_action ∈ A 4 else 5 w ˜ t←

π

θ

(

X t+1,w t

)

6 end 7

δ

←rt+1+

γ

·Q

(

X t+1,w t, ˜ w t

)

for

(

X t+1,w t, ˜ w t

)

∈S× A do 8 Q

(

X t,w t−1,w t

)

←Q

(

X t,w t−1,w t

)

+

α

·

(

δ

− Q

(

X t,w t−1,w t

)

. 9 end 10 Update: [X t+1,w t]T←[X t,w t−1]T w t+1← ˜ w t Update:

ε

←

ε

min+

(

ε

max−

ε

min

)

·exp

(

−

β

d·Bn

)

11 end 12 Return: Q

whichmeans that the CNN isentirely responsible forthe reallo-cationoftheassets.The discountrate

γ

determinesthetrade-off inbetweenimmediate(ormyopicreward)andlongtermstrategy. The Q value corresponds to the average logarithmicaccumulated return.TheQvalueisdeﬁnedas

Q

(

st,at

)

←Q

(

st,at

)

+

α

·

(

rt+1+

γ

·Q

(

st+1,at+1

)

−Q

(

st,at

)

,

(20)

wherert+1 is theimmediate reward.Therefore, followinga tran-sitionfromstate s t to s t+1,theQvalue mustbe updated accord-ing to Eq.(20).In other words, theQ value is continuously

esti-matedbased onthe currentpolicy

π

θ while thelatteris repeat-edly optimized in order to increase the Q value. The action and thestate aregivenby Eqs.(17)and(18)respectively. Eachaction is takenaccording to an

ε

-greedy strategy: a random number is generated froma uniform distribution: if thisnumber is smaller than a giventhreshold, a random actionis taken; ifnot, the ac-tion is determined by the current policy. This strategy improves theexplorationoftheactionspacewhichresultsinturninan op-timalsolutionwhile avoidingover-ﬁttingandaddiction (Garcia& Fernández, 2012). As such, this approach is reminiscentof simu-latedannealing(VanLaarhoven&Aarts,1987).The

ε

value is ini-tiallychosenclosetooneandisprogressivelyreducedtowardzero asthepolicy islearned.Thereafter, thepolicy is updated accord-ing to Eq.(20) which is essentially a gradient descent algorithm (Sutton,Bartoetal.,1998).

The implementation of the gradient descent algorithm is de-scribedinthenextsection.

4.2.Onlinestochasticbatching

Thegradientdescentalgorithmisimplementedwithanonline stochastic batching approach inorder to obtain an unbiased and

(7)

Fig. 3. Online learning batches generation.

optimalestimateofthegradient(Jiangetal.,2017).Financialdata aresequentialtime series.Therefore,trainingbatchesmustfollow thesametimeordering astheoriginal data.Inaddition,different startingpointsgeneratecompletelydistinctivebatches.Asaresult, distinct mini-batches may be obtained by simply changing their startingpoint.Forinstance,[tB,tB+Bs

)

and[tB+1,tB+Bs+1

)

cor-respondtotwodistinctbatches.Inthispaper,theonlinestochastic method,asproposed by Jiang et al.(2017),is employed in order togeneratemini-batches.Inthisapproach,itisassumedthatthe startingpointsaredistributedaccordingtoageometrical distribu-tion:

Pβ

(

tB

)

=

β

(

1−

β

)

t−TB−Bs, (21)

wherethedecayingrateisgivenby

β

∈(0,1)andwherethe rela-tioninbetweenthestartingpointandthemini-batchsizeisgiven byBs.Thisdistributiontendstofavormorerecentevents;the ex-tentofwhichiscontrolled bythedecayingrate. Inordertotrain theCNN,bothofflinelearningandonlinelearningareutilized.

Ourlearningstrategyisdescribedinthenextsubsection.

4.3.Offlineandonlinelearning

The CNN learnsthe policy offline which means that the pol-icyislearnedfromhistoricaldata.Historical dataisanydata an-teriorto thecurrentcalendar time.Theofflinelearningtechnique isbased on onlinestochastic batching which hasbeen described inthe previous subsection.Nevertheless,the underlying distribu-tion associated with a financial time series may vary over time. Thisphenomenoniscalledconceptdrift(Gamaetal.,2014).Inthe caseoffinancialinstruments,concept driftisusually triggeredby unforeseenexogenousfactorssuchaspoliticalturmoil,unexpected lossorpanic reaction(Cavalcante& Oliveira, 2015). Thesefactors mayaffectthe efficacy ofthe actions taken by the agent. Essen-tially,concept drift maybe handledin two differentways which are known as the active andthe passive approach (Rodríguez & Kuncheva, 2008). In the first case, a concept drift detector at-tempts to detect the occurrence of a concept drift. This is usu-allyachieved by determining if there is a noteworthy change in

the underlying data distribution (Elwell & Polikar, 2009). Once a conceptdriftisdetected,theagentistrainedafreshfromthe sub-sequentevents(Gama,Medas,Castillo,& Rodrigues,2004).Inthe secondapproach,theagentisinitiallytrainedfromhistoricaldata andthen,continuouslyupdatedasnewdatabecomeavailable.

Inourframework, weemploy thepassiveapproachwhich be-stowsuswithfeaturedrift adaptation(Gamaetal.,2014).Thisis mainly motivatedby tworeasons: ﬁrstly, intrading,behaviorsor eventsthatoccurredinthepastarelikelytorepeatthemselvesin thefuture(Huetal.,2015).Bycontrast,theactiveapproach does not takeadvantage ofthisprior knowledgethat existsin alatent inthehistorical data.Secondly,ifthemarket isunstable,concept drift may occur repeatedly,and the agentmight tend to employ short-terminvestmentstrategieswhichmayhaveadetrimental ef-fecton long-terminvestments (Huetal.,2015).Nonetheless, pas-siveconceptdriftdetectiontendstoforgetpatternsthatdonot re-peatthemselvesafteracertainamountoftime(Gamaetal.,2004). Theseso-calledanomalies havearelativelysmallimpacton long-terminvestmentstrategies.

Theonlinelearningisperformedforboththerestrictedstacked autoencoderandtheCNN.Thebatchesforonlinelearningare gen-eratedasoutlined inFig.3.A bufferisdesignedto storethe last ten(10) daysofoffline dataalong withone time-step (one busi-ness day) of streaming data. At the end of each day, the offline datasetisexpendedbyaddingthecurrentdatapointtotheoffline dataset.

In the next subsection, the convolutional neural network re-sponsibleforlearningthepolicyisdescribed

4.4. Convolutionalneuralnetwork

Inour framework, thepolicy is enforced witha convolutional neuralnetworkwhichisillustratedinFig.4.Indeed,convolutional neuralnetworkshavedemonstratedtheir abilitytolearn and im-plementeﬃcientlycomplexpoliciesinadeepreinforcement learn-ing framework (Jiang et al., 2017). The policy is initially learned offline from historical data with SARSA in conjunction with

(8)

linestochasticbatching.Oncetheofflinetrainingiscompleted,the policyiscontinuouslyupdatedonlineasnewinformationbecomes available.

The architecture ofour CNN isdescribed in Algorithm2. The

Algorithm 2 Implementationofconvolutionalneuralnetwork.

Data : f,m,n:NumberofFeatures,NumberofAssets,Trading Win-dowLength

Data : V ₁,V ₂,V ₃:ExtractedFeatures

13 Initialization:

Input: X t=[V 1,V 2,V 3]T while i≤n do

1. First Convolution:

ActivationFunction:Tanh Kernelsize→(1,3) Kerneldepth→ f

2. Second Convolution:

ActivationFunction:Tanh Kernelsize→(1,n) Kerneldepth→ f

3. Third Convolution:

Concatenateweightsforpreviousandcurrenttimesteps (WeightMatrix).

W =[W 1,W 2,...,W i]T ActivationFunction:Tanh Kernelsize→(1,1) Kerneldepth→ f+1 4. Adding Cash:

ConcatenateCashBiasandcomputedWeightMatrix 5. Softmax Layer:

Return: DistributedPortfolioWeights← W pt f = [w C,w 1,w 2,...,w m]T

14 end

CNNisfedwiththefeatureslearnedby therestrictedstacked au-toencoder.Thesefeaturesformatensorwhichconsistsofthethree (3)channels.Eachchannelformsamatrixwhichisassociatedwith aparticularabstractfeature.Eachmatrixcontainsthevaluesofthe abstract feature forthe assets(rows)forming the portfolio taken forthe last n days(columns).Then, convolution is performedon each channel with a single kernel of size 1 × 3 followed by a

Tanhactivation function.Once more,aconvolution withTanh ac-tivationisperformedoneachofthethreechannelsobtainedfrom theprevious convolutionwithasinglekernelofsize1× nin

or-der to obtain three vectors of size m. Thereafter, a fourth layer channel is addedwhichconsists oftheweights (action) obtained fromthepreviousiteration.Athirdandlastconvolutionwith acti-vationisperformedwitha1×1kernel(thedepthisequaltothe numberofchannels) inorderto obtaina singlevectorof sizem. ThelasttwostepsconstituteanimplementationofEq.(18)which statesthatthecurrentactionisdeterminedbythecurrentstateas well asthe previousaction (weights).Finally, theprevious vector isnormalizedwithafunctioninordertoobtaintheportfolionew weightsi.e.thecurrentaction.

Thesettlementprocessisaddressedinthenextsection.

5. Settlement with blockchain

Sofar,wehaveimplicitlyassumedthatstockexchange transac-tions maybe ﬁnalizedin real-time.Unfortunately, thisisnot the case.Indeed,althoughasaleorapurchasemaybeperformedina fractionofasecond,atransactionsettlementmaytakedays,even weeks (Iansiti & Lakhani, 2017). This is at the origin of a major problemcalledsettlementriskwhichresultsfromafailure,either

Fig. 5. Distributed network associated with the settlement blockchain.

onthepartofthe selleroronthe partofthebuyer,to delivera soldassettoortopayforthelatter.

Generallyspeaking, a specialized third-party coordinates legal ownershipofsecuritiesversus payment:theso-calledCentral Se-curitiesDepository(Schmiedel,Malkamäki,&Tarkka,2006).In ad-dition, intermediaries such as brokers, custodians and payment agents are involved in the process (Pacces, 2000). Consequently, a typical settlement process is both time, and cost consuming (Benos, Garratt, & Gurrola-Perez, 2017). In order to address this problem,we haveimplemented asettlement systemona Bitcoin blockchain (Nakamoto et al., 2008). Itis assumedthat the assets maybeexchanged onadigitalmarket: theextentofwhichis de-terminedbylegalconsiderationsratherthanbytechnicalones.The cashandassetsassociatedwiththeportfolioarestoredinadigital wallet(Dwyer,2015).

Each time the reinforcement learning algorithm sells or buys an asset, a transaction is initiated. Each transaction corresponds to a block in the blockchain. A transaction consists of two pri-mary components; the input object, which contains the infor-mation about the original sender (timestamp,balance, signature, sender’spublickey)andtheoutputobject,whichcontainsthe na-tureandtheamountofthetransactions aswell astheaddressof thereceiver.The transactionisnot validatedby athird partybut by a distributed consensus mechanism known as proof-of-work (Nakamotoetal.,2008).

Ina consensusmechanism,allthenodesformingthenetwork, such as the one illustrated in Fig. 5,compete to solve a crypto-graphicproblemwhichiscomputationallyprohibitivetosolvebut whosesolutioniseasilyvalidated.The nodesorminers mayonly solvethisproblemonatrialanderrorbasis whichmeansthat it isalmost impossibletoobtain rapidlythesolution. Consequently, itisvirtuallyimpossibletopredictwhichminershallvalidate the transaction.

Thecryptographicproblemconsistsoffindinganumberwhich islowerthanthehash(Sobti&Geetha,2012) associatedwiththe newblock(targethash) associatedwiththe transaction.Notonly thisnumbermustbelowerthanthetargethash,butitmustbe de-terminedwithinacertaintolerancegivenbythenonce.Thelower thetolerance,themoredifficulttheproblembecomes.The miner thatfirst solvesthe problemadds thenewvalidatedblockto the blockchainandreceivesarewardintermsofeitherdigitalor cryp-tocurrenciesasanincentive.Thevalidationconsistsofdetermining

(9)

Table 2

Composition of our Portfolio for the period in between January 2, 2002 and July 31, 2018.

Sector Name

Technology Apple Inc. (AAPL) Adobe Inc. (ADBE) Cisco Systems, Inc. (CSCO)

Intel Corporation (INTC)

International Business Machines Corporation (IBM) Microsoft Corporation (MSFT)

NVIDIA Corporation (NVDA) Oracle Corporation (ORCL)

Financial Services

Bank of America Corporation (BAC) HSBC Holdings plc (HSBC) Royal Bank of Canada (RY) The Toronto-Dominion Bank (TD)

Consumer Cyclical

Amazon.com, Inc. (AMZN) The Walt Disney Company (DIS)

The Home Depot, Inc. (HD)

Consumer Defensive

Walmart Inc. (WMT) The Coca-Cola Company (KO)

Industrials Boeing Co (BA)

Healthcare Johnson & Johnson (JNJ) Merck & Co., Inc. (MRK) Novo Nordisk A/S (NVO)

Novartis AG (NVS)

Communication Services Verizon Communications Inc. (VZ)

ifboth themoney andthe assetare available andthen,to ﬁnal-izethe transaction.As a result, thesettlement problem issolved withoutanyrecourse toan intermediary(Wall& Malm,2016). In addition,such a distributedsystem mayovercomea cyber-attack inwhich50%orlessofitsnodesareovertaken(Sayeed& Marco-Gisbert,2019).

It is important to point out that theblockchain isnot an es-sentialpart ofourframeworkinthe sense thatitdoesnot affect thedeep reinforcement learning investment strategy. The role of the blockchain is to mitigate the settlement risk; a risk that is also present in traditional trading. Furthermore, our approach is notlimitedto digitalassetsasit maybeapplied toanyﬁnancial instrumenttradedonastockexchange.

Ourexperimentalresultsarepresentedinthenextsection.

6. Experiment and results

Thecompositionofourportfolio,intermsofstocks,appearsin

Table3. Thetimes seriesstartson January 2,2002, andends on July31,2018.

In order to evaluate the performance of our system in terms ofreturnoninvestment,twometrics wereemployed:namelythe Sharperatio andthe maximumdrawdown (MDD). Thesemetrics areamongthemostcommonlyemployedinportfolioperformance evaluation(Bessler,Opfer,&Wolff,2017).TheSharperatiowas in-troducedbySharpe(1994)andisessentiallyaratioinbetweenthe

Table 4

Training and test sets.

ID Training set Test set

Test 1 2002-01-02 to 2008-08-15 2009-06-15 to 2009-10-21 Test 2 2002-01-02 to 2011-12-06 2012-10-03 to 2013-02-14 Test 3 2002-01-02 to 2015-04-06 2016-02-01 to 2016-06-09 Test 4 2002-01-02 to 2017-05-19 2018-03-19 to 2018-07-27

Table 5

Maximum drawdown (MDD) and Sharpe ratio ( S r ) as obtained with our system for the four test sets.

ID Investment duration MDD (%) algorithm Sr Algorithm

Test_{Set 1} _{30 Days} _1.71 −1.82

60 Days 4.87 1.57

90 Days 7.49 2.44

Test_{Set 2} 30 Days 2.43 −1.014

60 Days 6.91 2.22

90 Days 8.23 2.20

Test_{Set 3} _{30 Days} _4.62 _2.9

60 Days 9.44 2.34

90 Days 17.46 3.47

Test_{Set 4} _{30 Days} _3.13 _3.13

60 Days 12.52 3.09

90 Days 12.52 2.26

Table 6

Return on investment for our approach and for the Dow Jones Industrial ( Bloomberg, 2019 ).

ID Investment duration RoI (%) algorithm RoI(%) DJI

Test_{Set 1} 30 Days 0.24 5.62

60 Days 3.67 10.85

90 Days 5.74 15.52

Test_{Set 2} _{30 Days} _0.5 −6.85

60 Days 4.83 −4.12

90 Days 6.91 3.88

Test_{Set 3} 30 Days 2.43 4.88

60 Days 7.4 9.68

90 Days 18.09 9.33

Test_{Set 4} _{30 Days} _2.83 −2.07

60 Days 11.14 2.4

90 Days 11.93 3.72

expectedreturnoninvestmentandtherisk:

Sr= E

Rf−Rb

σ

d . (22)

wherethenumeratoristheexpectation onthedifferentialreturn andwhere

σ

d isthestandard deviationassociated withtheasset excess return (or volatility); the higher the Shape ratio the bet-ter.Itis generallyacceptedthat a Sharperatiobelowone is sub-optimal; a ratio in between one and two is acceptable while a ratioin betweentwo andthree isconsidered very good.Any ra-tiooverthreeisregardedasexcellent (Maverick,2019).The max-imum draw down metric was introduced by Magdon-Ismail and

Table 3 Hyper-parameters.

Hyper-parameters Size Description

Portfolio Size 23 Number of assets considered to create a portfolio

Trading Window 30 Number of trading days in which agents trained to increase the portfolio value (investment return)

Trading Period 1 Day Length of time in which price movements occur

Time Interval 4172 Number of Days from 02/01/2002 to 27/07/2018

Offline Learning Rate 10 −4 _Parameter_α_{of the Adam optimization (}_{Kingma & Ba, 2015}_{) for offline training.}

Online Learning Rate 10 −1 _{Learning rate for online training session.}

Sample Bias 10 −5 _{Parameter of geometric distribution when selecting online training sample batches. (The parameter}_β_in_{Eq. (21)}_of_{Section 4.2}₎

Commission Rate 1.5% Rate of commission fee applied to each transaction

Risk Free Rate 1% The theoretical investment return rate with zero risk

(10)

Fig. 6. Evolution of the portfolio weights during the ﬁrst 11 days of training for the online learning process.

Fig. 7. Final portfolio weights distributions for the ﬁrst training set.

Atiya(2004).ThisratioisafunctionofthemaximumPand mini-mumvalueLreachedbytheportfoliothroughoutacertainperiod

oftime;performantportfoliosarecharacterizedbyahighMDD.

MDD=P−L

P . (23)

In order to validate our approach, fourtraining sets were gener-ated.Asmentionedearlier,theassetsformingtheportfolioare de-scribed in Table 2. Each trainingset covers a different period or window:allperiodsareinitiatedonJanuary2,2002,whileending onAugust15,2008;December6,2011;April6,2015,andMay19, 2017 respectively.Tothesefourtrainingsetscorrespond fourtest sets of90 days each startingthe day following the end of their

respectivetrainingsets.Thetrainingsets andtheir corresponding test sets are describedin Table 4. Initially, the portfolio consists solelyofcash. Foreachtrainingset,theagentlearnstheweights associatedwiththeassetsinordertoincreasetheexpectedreturn on investment. The hyper-parameters associated with the agent appearinTable3.

Thefinalweightsdistributions,forthefirsttrainingset,during trainingepisodes,areshowninFig. 7.Thefinal weights distribu-tionsofthesecondtrainingset,forthefourtrainingepisodes,are showninFig.10.Asit maybenoticed fromthe figures,thefinal distributionisaffected bythe durationof thetrainingperiod. In-deed,longerperiodsencompassstatesoreventsthatdonotoccur

(11)

Fig. 8. Final portfolio weights distributions for the ﬁrst test set over a period of 30, 60 and 90 days.

Fig. 9. Relative return on investment for the ﬁrst testing set over a period of 30, 60 and 90 days.

inshorterperiodswhichinturnaffectthedecisionstakenby the agent.Asillustrated byFig.6,weightsﬂuctuateduringtheonline learningprocess.Nevertheless,someweights aresubjectedto im-portantvariations,such asAppleandAmazon,whileothers,such asHSBCandIBM,arenot.Thisbehaviormaybe explainedby the fact that the assets associated with the most stable weights are utilizedinordertomitigatetherisk (portfoliohedging)whilethe mostvolatileareemployedfortheirleverageeffectwhichexplains thehighvariabilityoftheirweights.

Fig.8showstheﬁnalweightsdistributions,fortheﬁrsttestset, afterthirty,sixtyandninetydaysrespectively.Althoughsometimes subtle,thecorrectionshaveneverthelessalargeimpactonthe ex-pectedreturn on investment.Indeed, asillustrated by Fig. 9, the relativereturnoninvestment,withrespecttotheinitialamountof

cash,steadilyincreasesoverthisninetydaysperiodwitharelative againofalmost7%.Despitesomeﬂuctuations,thereisacontinued upwardtrend.

Likewise, Figs. 10–18show the final trainingweights distribu-tionsforthelastthreetrainingsets,thefinalweightsdistributions for the last three testing sets aswell astheir respective relative return oninvestments. In each set, a clear upward trend forthe expected returnon investment may be observed. The longer the trainingperiod,themorediscontinuousistheevolutionofthe ex-pected returnon investment. The originofthe behavior is to be foundinthefactthatalongertrainingperiodallowstheagentto learn morecomplex actions which inturn resultinmore drastic decisions.Nonetheless,theprocessremainsveryefficientasthe

(12)

ﬁ-Fig. 10. Final portfolio weights distributions for the second training set.

Fig. 11. Final portfolio weights distributions for the second test set.

nal relativereturnson investmentareapproximately7%, 18% and 11%forthelastthreetestsetsrespectively.

OurresultsaresummarizedinTable5.Fromthetable,itshould benoticedthattheSharperatiosassociatedwithourapproachare eithervery goodorexcellentfornine outofthetwelvetest sets; thelatterbeingdescribedinTable4.Thetestset1hastheshortest trainingwindowwhileTestset4asthelongest.Therefore,Table5. clearlydemonstratesthattheperformancesofoursystemare im-provediftheCNNistrainedoveralargehistoricaldatasetsuchas Testset4.Inotherwords,alarge amountofhistorical datais re-quiredin ordertolearn eﬃcientlythepolicy.Indeed,forTestset 4, the Sharpe ratio is excellent fora time horizonof thirty and sixtydayswhileitisvery goodfora long-terminvestment made

overninetydays.Theseresultsarecorroborated bythemaximum drawdownmetric.

In orderto further ascertain the performances ofour system, we compare the expectedreturn on investment of our approach withtheexpectedreturnoninvestmentassociatedwiththeDow JonesIndustrial (DJI);thelatterbeingconsidered abenchmark in theﬁnance industry (Zhang,Wang, Li,& Shen, 2018). Werestrict ourdiscussion toTestset4aswehaveestablishedthattheagent must be trainedover a large historical dataset inorder to prop-erlylearnthepolicy.OurresultsarereportedinTable6.Withour approach, the returns on investment (RoI) are 2.83%, 11.14% and 11.93%overperiods of30, 60and90days respectivelywhile the RoI, forthe DJI,is 2.07%(a loss!),2.4% and3.72%over the same

(13)

Fig. 12. Relative return on investment for the second testing set over a period of 30, 60 and 90 days.

(14)

Fig. 14. Final portfolio weights distributions for the third test set.

(15)

Fig. 16. Final portfolio weights distributions for the fourth training set.

(16)

Fig. 18. Relative return on investment for the fourth testing set over a period of 30, 60 and 90 days.

time horizons.Theseresultsclearlydemonstratetheeffectiveness ofoursystem.

7. Conclusion

This paper proposed a framework for portfolio management and optimization based on deep reinforcement learning called DeepBreath.Aportfoliocontainsmultipleassetsthataretraded si-multaneouslytoautomaticallyincreasetheexpectedreturnon in-vestment whileminimizingtherisk. Theinvestmentpolicy is im-plemented using CNN; as a result, financial instruments are re-allocated by buying andselling shares on the stock market. The neuralnetwork,aswellasitshyper-parameters,iscommontoall the assets which means that the computational complexity only increaseslinearlywiththenumberoffinancialinstruments.In or-der to reduce the computational complexity associated with the size ofthe trainingdataset,a restrictedstacked autoencoder was employedinordertoobtainnon-correlatedandhighlyinformative features.Theinvestmentpolicy waslearnedwiththeSARSA algo-rithm,bothofflineandonline,usingonlinestochasticbatchingand passive online learning respectively. The passive online learning approach handled conceptdrift resulting fromexogenous factors. Thesettlementriskproblemwasmitigatedwithablockchain.Our experimentalresultsdemonstratedtheefficiencyofoursystem.In futurework,exogenousfactors,such associalmedia,willbe inte-grated intothe frameworkin ordertotake intoaccount their ef-fect on pricefluctuations. In addition, the theoretical model will beimprovedinordertotakeintoaccounttransactionsinwhicha largevolumeofassetsaretradedi.e.whentheactions(buying or selling)oftheagentmayhavealargeimpactontheenvironment (largevariationinpricesandmarketliquidity).Finally,weplanto exploredeepreinforcementlearningarchitectureswhicharemore suitableforlong-terminvestmentstrategies.

Declaration of Competing Interest

All authors haveparticipated in(a) conception anddesign,or analysisandinterpretation ofthe data;(b)drafting thearticle or

revisingitcriticallyforimportantintellectualcontent;and(c) ap-provaloftheﬁnalversion.

Thismanuscripthasnotbeensubmittedto,norisunderreview at,anotherjournalorotherpublishingvenue.

Theauthorshavenoaﬃliationwithanyorganizationwitha di-rectorindirectﬁnancialinterestinthesubjectmatterdiscussedin themanuscript

Credit authorship contribution statement

Farzan Soleymani: Conceptualization, Methodology, Writing

-originaldraft, Software,Visualization, Validation. Eric Paquet: Su-pervision,Writing-review&editing,Datacuration.

References

Achelis, S. B. (2001). Technical analysis from a to z . McGraw Hill New York . Atsalakis, G. S. , & Valavanis, K. P. (2009). Surveying stock market forecasting tech-

niques–part II: Soft computing methods. Expert Systems with Applications, 36 (3), 5932–5941 .

Bao, W. , Yue, J. , & Rao, Y. (2017). A deep learning framework for ﬁnancial time se- ries using stacked autoencoders and long-short term memory. PloS One, 12 (7), e0180944 .

Bengio, Y. , Goodfellow, I. J. , & Courville, A. (2015). Deep learning. Nature, 521 (7553), 436–4 4 4 .

Benos E, Garratt R, Gurrola-Perez P. The economics of distributed ledger technology for securities settlement. Available at SSRN 3023779. 2017 Aug 22.

Bessler, W. , Opfer, H. , & Wolff, D. (2017). Multi-asset portfolio optimization and out- -of-sample performance: An evaluation of black–litterman, mean-variance, and naïve di versiﬁcation approaches. The European Journal of Finance, 23 (1), 1–30 .

Bloomberg (2019). Dow Jones Industrial (DJI), stock quote. https://www.bloomberg. com/quote/INDU:IND . Accessed: 2019-04-15.

Bouchaud, J.-P. (2011). The endogenous dynamics of markets: Prce impact, feedback loops and instabilities. Lessons from the credit crisis . Risk Publications . Cavalcante, R. C. , & Oliveira, A. L. I. (2015). An approach to handle concept drift in ﬁ-

nancial time series based on extreme learning machines and explicit drift detec- tion. In 2015 International joint conference on neural networks (IJCNN) (pp. 1–8). IEEE .

Cheng, Z. , Sun, H. , Takeuchi, M. , & Katto, J. (2018). Deep convolutional autoen- coder-based lossy image compression. In 2018 Picture coding symposium (PCS) (pp. 253–257). IEEE .

Choung, O.-h. , Lee, S. W. , & Jeong, Y. (2017). Exploring feature dimensions to learn a new policy in an uninformed reinforcement learning task. Scientiﬁc Reports, 7 (1), 17676 .

Cutler, DM , Poterba, JM , & Summers, LH (1988). What moves stock prices? National Bureau of Economic Research, 17 Mar 1 .

(17)

Dwyer, G. P. (2015). The economics of bitcoin and similar private digital currencies. Journal of Financial Stability, 17 , 81–91 .

Eakins, S. G. , & Stansell, S. R. (2003). Can value-based stock selection criteria yield superior risk-adjusted returns: An application of neural networks. International Review of Financial Analysis, 12 (1), 83–97 .

Elwell, R. , & Polikar, R. (2009). Incremental learning of variable rate concept drift. In International workshop on multiple classiﬁer systems (pp. 142–151). Springer . Fama, E. F. (1991). Eﬃcient capital markets: II. The Journal of Finance, 46 (5),

1575–1617 .

Filimonov, V. , & Sornette, D. (2012). Quantifying reflexivity in financial markets: To- ward a prediction of flash crashes. Physical Review E, 85 (5), 056108 .

Gama, J. , Medas, P. , Castillo, G. , & Rodrigues, P. (2004). Learning with drift detection. In Brazilian symposium on artiﬁcial intelligence (pp. 286–295). Springer . Gama, J. , Žliobaitè, I. , Bifet, A. , Pechenizkiy, M. , & Bouchachia, A. (2014). A survey on

concept drift adaptation. ACM Computing Surveys (CSUR), 46 (4), 44 .

Garcia, J. , & Fernández, F. (2012). Safe exploration of state and action spaces in re- inforcement learning. Journal of Artiﬁcial Intelligence Research, 45 , 515–564 . Goodfellow, I. , Bengio, Y. , & Courville, A. (2016). Deep learning . MIT Press . Hinton, G. E. , & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data

with neural networks. Science, 313 (5786), 504–507 .

Hu, Y. , Liu, K. , Zhang, X. , Xie, K. , Chen, W. , Zeng, Y. , & Liu, M. (2015). Concept drift mining of portfolio selection factors in stock market. Electronic Commerce Re- search and Applications, 14 (6), 4 4 4–455 .

Hussain, A. J. , Knowles, A. , Lisboa, P. J. G. , & El-Deredy, W. (2008). Financial time se- ries prediction using polynomial pipelined neural networks. Expert Systems with Applications, 35 (3), 1186–1199 .

Iansiti, M. , & Lakhani, K. R. (2017). The truth about blockchain. Harvard Business Review, 95 (1), 118–127 .

Ivanov, S., & D’yakonov, A. (2019). Modern deep reinforcement learning algorithms.

arXiv preprint arXiv:1906.10025 .

Jangmin, O. , Lee, J. , Lee, J. W. , & Zhang, B.-T. (2006). Adaptive stock trading with dynamic asset allocation using reinforcement learning. Information Sciences, 176 (15), 2121–2147 .

Jiang, Z., Xu, D., & Liang, J. (2017). A deep reinforcement learning framework for the ﬁnancial portfolio management problem. arXiv: 1706.10059 .

Joulin, A . , Lefevre, A . , Grunberg, D. , & Bouchaud, J.-P. (2008). Stock price jumps: News and volume play a minor role arXiv:0803.1769 .

Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization (2014). arXiv: 1412.6980 . 15.

Lam, M. (2004). Neural network techniques for ﬁnancial performance prediction: Integrating fundamental and technical analysis. Decision Support Systems, 37 (4), 567–581 .

Magdon-Ismail, M. , & Atiya, A. F. (2004). Maximum drawdown. Risk Magazine, 17 (10), 99–102 .

Malkiel, B. G. (2003). The eﬃcient market hypothesis and its critics. Journal of Eco- nomic Perspectives, 17 (1), 59–82 .

Malkiel, B. G. , & Fama, E. F. (1970). Eﬃcient capital markets: A review of theory and empirical work. The Journal of Finance, 25 (2), 383–417 .

Maverick, J. B. (2019). What is a good sharpe ratio? https://www.investopedia.com/ ask/answers/010815/what- good- sharperatio.asp .

Moody, J. , Wu, L. , Liao, Y. , & Saffell, M. (1998). Performance functions and reinforce- ment learning for trading systems and portfolios. Journal of Forecasting, 17 (5–6), 441–470 .

Murphy, J. J. (1999). Technical analysis of the ﬁnancial markets: A comprehensive guide to trading methods and applications . Penguin .

Nakamoto S, Bitcoin A. A peer-to-peer electronic cash system. Bitcoin.URL: https://bitcoin. org/bitcoin. pdf. 2008.

Ormos, M. , & Urbán, A. (2013). Performance analysis of log-optimal portfolio strate- gies with transaction costs. Quantitative Finance, 13 (10), 1587–1597 .

Pacces, A. M. (20 0 0). Financial intermediation in the securities markets law and economics of conduct of business regulation. International Review of Law and Economics, 20 (4), 479–510 .

Patel, J. , Shah, S. , Thakkar, P. , & Kotecha, K. (2015). Predicting stock and stock price index movement using trend deterministic data preparation and machine learn- ing techniques. Expert Systems with Applications, 42 (1), 259–268 .

Rodríguez, J. J. , & Kuncheva, L. I. (2008). Combining online classiﬁcation approaches for changing environments. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recog- nition (SSPR) (pp. 520–529). Springer .

Ross, S., Mineiro, P., & Langford, J. (2013). Normalized online learning arXiv: 1305. 6646 .

Sayeed, S. , & Marco-Gisbert, H. (2019). Assessing blockchain consensus and security mechanisms against the 51% attack. Applied Sciences, 9 (9), 1788 .

Schmiedel, H. , Malkamäki, M. , & Tarkka, J. (2006). Economies of scale and techno- logical development in securities depository and settlement systems. Journal of Banking and Finance, 30 (6), 1783–1806 .

Sharpe, W. F. (1994). The sharpe ratio. Journal of Portfolio Management, 21 (1), 49–58 . Sobti, R. , & Geetha, G. (2012). Cryptographic hash functions: A review. International

Journal of Computer Science Issues (IJCSI), 9 (2), 461 .

Sorzano, C. O. S., Vargas, J., & Montano, A. P. (2014). A survey of dimensionality re-

duction techniques arXiv: 1403.2877 .

Sutton, R. S. , & Barto, A. G. (2018). Reinforcement learning: An introduction . MIT Press . Sutton, R. S. , Barto, A. G. , et al. (1998). Introduction to reinforcement learning : vol. 2.

MIT Press Cambridge .

Tsai, C.-F. , & Hsiao, Y.-C. (2010). Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches. Deci- sion Support Systems, 50 (1), 258–269 .

Van Laarhoven, P. J. M. , & Aarts, E. H. L. (1987). Simulated annealing. In Simulated annealing: Theory and applications (pp. 7–15). Springer .

Wall, E., & Malm, G. (2016). Using blockchain technology and smart contracts to create a distributed securities depository.

Wong, T. , & Luo, Z. (2018). Recurrent auto-encoder model for large-scale in- dustrial sensor signal analysis. In Engineering applications of neural networks (pp. 203–216). Springer International Publishing .

Zhang, W. , Wang, P. , Li, X. , & Shen, D. (2018). The ineﬃciency of cryptocurrency and its cross-correlation with dow jones industrial average. Physica A: Statistical Mechanics and its Applications, 510 , 658–670 .

Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoder: DeepBreath

Publisher’s version / Version de l'éditeur:

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la

première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez

pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.

Questions? Contact the NRC Publications Archive team at

PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the

first page of the publication for their contact information.

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site

LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

Expert Systems with Applications, 156, 2020-04-18

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

NRC Publications Archive Record / Notice des Archives des publications du CNRC :

https://nrc-publications.canada.ca/eng/view/object/?id=aca48a15-8e8f-4f94-942a-e498d5e7fe7d

https://publications-cnrc.canada.ca/fra/voir/objet/?id=aca48a15-8e8f-4f94-942a-e498d5e7fe7d

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. /

La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version

acceptée du manuscrit ou la version de l’éditeur.

For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien

DOI ci-dessous.

https://doi.org/10.1016/j.eswa.2020.113456

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at