Trading algorithm - actions in Q-learning / DQN

Completed the following with MATLAB.

I am trying to build a trading algorithm using Deep Q training. I just spent the daily stock prices and I am using this as a training kit.

My fortune - mine - the amount of cash I have, - the number of shares I own, and - the price of the share at that time. [money, stock, price]


money


stock


price

The problem I am facing is actions; looking online, people have only three steps { buy | sell | hold }

.

My reward function is the difference between the portfolio value at the current time step and the previous time step.

But using only three steps, I'm not sure how to choose to buy, say, 67 shares at a price?

I am using a neural network to approximate the q values. It has three inputs [money, stock, price]

and 202 outputs, i.e. I can sell 0 to 100 shares, 0 I can hold stock, or I can buy 1-100 shares.

Can anyone shed some light on how I can reduce this to 3 actions?

My code:

%  p is the stock price
% sp is the stock price at the next time interval 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

hidden_layers =   1;
actions       = 202;
net           = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
                       [hidden_layers, actions],
                       {'tansig','purelin'},
                       'trainlm'
                       );

net           = init( net );

net.trainParam.showWindow = false;

% neural network training parameters -----------------------------------
net.trainParam.lr     =   0.01;
net.trainParam.mc     =   0.1;
net.trainParam.epochs = 100;

% parameters for q learning --------------------------------------------
epsilon        =    0.8;
gamma          =    0.95;
max_episodes   = 1000;
max_iterations = length( p ) - 1;

reset          =    false;
inital_money   = 1000;
inital_stock   =    0;

%These will be where I save the outputs
save_s        = zeros( max_iterations, max_episodes );
save_pt       = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a        = zeros( max_iterations, max_episodes );

% construct the inital state -------------------------------------------
% a           = randi( [1 3], 1, 1 );  
s             = [inital_money;inital_stock;p( 1, 1 )];


% construct initial q matrix -------------------------------------------
Qs            = zeros( 1, actions );
Qs_prime      = zeros( 1, actions );


for     i = 1:max_episodes
    for j = 1:max_iterations             % max_iterations --------------

        Qs = net( s );

        %% here we will choose an action based on epsilon-greedy strategy

        if ( rand() <= epsilon )
            [Qs_value  a] = max(Qs);
        else 
            a = randi( [1 202], 1, 1 );
        end

        a2                 = a - 101;
        save_a(j,i)        = a2;
        sp                 = p( j+1, 1 ) ; 
        pt                 = s( 1 ) + s( 2 ) * p( j, 1 );
        save_pt(j,i)       = pt; 
        [s_prime,reward]   = simulateStock( s, a2, pt, sp );

        Qs_prime           = net( s_prime );

        Q_target           = reward + gamma * max( Qs_prime );
        save_Q_target(j,i) = Q_target;
        Targets            = Qs;

        Targets( a )       =  Q_target;

        save_s( j, i )     = s( 1 );
        s                  = s_prime;
    end

    epsilon = epsilon * 0.99 ; 
    reset   = false; 
    s       = [inital_money;inital_stock;p(1,1)];
end

% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
                           money   = s(1);
                           stock   = s(2);
                           price   = s(3);

                           money   = money - a * price ;
                           money   = max( money, 0 );
                           stock   = s(2) + a;
                           stock   = max( stock, 0 );

                           s_prime = [money;stock;sp];
                           reward  = ( money + stock * price ) - pt;
end

      

+3


source to share


1 answer


Prologue: This answer reflects many decades (actually more than I would like to admit) about practical skills, so forgive me that I have such a pool of experience with quantitative methods and innovations that provided me with this deeper knowledge that works, what can and does not work (out of respect, WHY it cannot work), even though modern (in most cases just populist) media is broadcast so massively and so irresponsibly.


Action: undefined
(unless a definitive reason for such a flattened, decapitated and notoriously short model is given)

You might be right that using a series only is a frequent habit only for academic work, where authors sometimes choose to illustrate their demonstrated academic efforts to improve teaching / statistics methods and opt for a sample application selection in the trading domain. Pitti, this can be done in scientific articles, but not in the reality of commerce. { buy | hold | sell }

Why?

Even with a basic view of trading, the problem is much more complex. As a quick reference, there are more than five main domains of this model space. Considering that trading should be simulated, one cannot be left without a fully described strategy -

Tru-Strategy := {    SelectPOLICY,
                     DetectPOLICY,
                        ActPOLICY,
                   AllocatePOLICY,
                  TerminatePOLICY
                  }

      

Any motivated simplification that would prefer to omit any one domain of these five main domains would be anything but a truly trading strategy.

It is easy to understand what happens by simply training (worse than using such a model later on in real trades) with a poorly defined model that is inconsistent with reality.

Of course, it can achieve (and will (again, if the criterion of minimization criterion is not formulated)), then some mathematical function is minimal, but this does not guarantee reality to immediately change its still natural behavior and begin to "obey" a poorly defined model and "dance" according to such simplified or otherwise distorted (poorly modulated) optics about reality.


Rewards: Vague
(no reasons given for ignoring fact or deferred rewards)

If in doubt this means, try following an example:
Today the model-strategy decides . Tomorrow falls, around 0.1% and therefore the immediate reward (as suggested above) is negative, thus punishing such a decision. The model is encouraged not to do this (do not buy AAPL). A:Buy(AAPL,67)


AAPL



The fact is that after a certain period of time, AAPL rises much higher, receiving a much higher reward compared to the initial fluctuations of D2D Close

, which is known, but the proposed Q-fun model-model just fundamentally mistakenly did not reflect at all.

Beware of WYTIWYG - What you train is what you get ...

This means that the model "as is" can be trained to act in accordance with such specific incentives, but its actual behavior will contribute to NOTHING, but such extremely naive intraday "quasi-scalping" snapshots with limited (if any) support from the actual state of the market and market dynamics, as is available in many generally accepted quantitative models.

So, I'm sure you can teach a blind person who was blind and deaf (ignoring the reality of the problem domain) , but for what?


Epilogue:

There is nothing like "data science"
even as MarCom and HR beat their drums and whistles as they really do a lot today

Why?

It is because the above rationale. The presence of data points is nothing. Sure, it's better than standing incoherently in front of a client without a single eye on reality, but data points don't save the game.

It is domain knowledge that starts to make some sense from data points rather than data points per se.

If you are still in doubt, if you have multiple terabytes of numbers, you are not sure what data represents the data.

On the other hand, if known, from a domain specific context, these data points should be readings of temperature as long as there is no Data-Science God to tell you if everything is (coincidentally) in [Β° K] or [Β° C] (if there is only positive reading> = 0.00001).

+2


source







All Articles