How to randomly sample lognormal data in Python using inverse CDF and specify target percentiles?

I am trying to generate random samples from a lognormal distribution in Python, an application designed to simulate network traffic. I would like to generate samples in such a way that:

  • Modal fetch result is 320 (~ 10 ^ 2.5)
  • 80% of samples are in the range 100 to 1000 (10 ^ 2 to 10 ^ 3)

My strategy is to use inverse CDF (or Smirnov transform, which I believe):

  • Use PDF for normal distribution centered around 2.5 to calculate PDF for 10 ^ x, where x ~ N (2.5, sigma).
  • Calculate the CDF for the above distribution.
  • Generate random homogeneous data over the interval from 0 to 1.
  • Use inverse CDF to convert random, homogeneous data to the desired range.

The problem is that when I calculate the 10th and 90th percentiles at the end, I have completely wrong numbers.

Here is my code:

%matplotlib inline

import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm

# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)

# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')

x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)

# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)

# Generate random uniform data
input = np.random.uniform(size=10000)

# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]

# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]

# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)

      

Which produces the output:

(223.99999999999997, 2480.0000000000009)

      

... not the (100, 1000) I would like to see. Any advice is appreciated!

+3


source to share


2 answers


First, I'm not sure about Use the PDF for a normal distribution centred around 2.5

. After all, log-normal has a base logarithm e

(aka natural log), which means 320 = 10 2.5 = e 5.77 .

Secondly, I would approach the problem differently. You need m

and s

sample from Log-Normal .

If you look at the wiki article above, you can see that this is a two-parameter distribution. And you have exactly two conditions:



Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8

      

where CDF is expressed in terms of an error function (which is a fairly common function found in any library)

So, two nonlinear equations for two parameters. Solve them, find m

and s

and place it in any standard log-normal sample

+2


source


Severin's approach is much more compact than my original attempt using the Smirnov transform. This is the code that worked for me (using fsolve to find s, although its rather trivial to do it manually):

# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation

%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

from scipy.optimize import fsolve
from scipy.stats import lognorm
import math

target_modal_value = 320

# Define function to find roots of
def equation(s):

    # From Wikipedia: Mode = exp(m - s*s) = 320
    m = math.log(target_modal_value) + s**2

    # Get probability mass from CDF at 100 and 1000, should equal to 0.8.
    # Rearange equation so that =0, to find root (value of s)
    return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)

# Solve non-linear equation to find s
s_initial_guess = 1
s =  fsolve(equation, s_initial_guess)

# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))

# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')

      



Logarithmic distribution with mode at 320

+2


source







All Articles