Fitting
-------

Requires:

- r-4.1.3-1

Conventions and best approaches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- S0 Signal for unbound state

- S1 Signal for bound state

- K equilibrium constant (Kd or pKa)

- order data from unbound to bound (e.g. cl: 0–>150 mM; pH 9–>5)

- lmfit.Model has very convenient results plot functionalities and the unique possibility to estimate upper lower fitting curves.

- R nls and nlsboot seems very convenient (Q-Q plt) and fast; nls can perform global fit but nlstools can not.

Initial imports
~~~~~~~~~~~~~~~

.. code:: python

    import numpy as np
    import scipy
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sb
    import rpy2
    from rpy2.robjects import r
    from rpy2.robjects.packages import importr
    from rpy2.robjects import globalenv
    from rpy2.robjects import pandas2ri
    import lmfit

    pandas2ri.activate()
    %load_ext rpy2.ipython

    MASS = importr('MASS')
    #r('library(MASS)')

::

    The rpy2.ipython extension is already loaded. To reload it, use:
      %reload_ext rpy2.ipython

Single Cl titration.
~~~~~~~~~~~~~~~~~~~~

.. code:: python

    df = pd.read_table("../../tests/data/copyIP.txt")
    sb.scatterplot(data=df, x="cl", y="F", hue=df.cl*df.F, palette="crest", s=200, alpha=.8, legend=False)

.. image:: ../_static/f01.png

Using R
^^^^^^^

.. code:: R

    d <- read.delim("../../tests/data/copyIP.txt")
    fit = nls(F ~ (S0 + S1 * cl / Kd)/ (1 + cl / Kd), start = list(S0=7e7, S1=0, Kd=12), data=d)
    summary(fit)

::


    Formula: F ~ (S0 + S1 * cl/Kd)/(1 + cl/Kd)

    Parameters:
        Estimate Std. Error t value Pr(>|t|)
    S0 1.341e+07  8.713e+04 153.894 5.08e-12 ***
    S1 5.635e+05  1.064e+05   5.296  0.00184 **
    Kd 5.832e+01  2.247e+00  25.957 2.16e-07 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 118200 on 6 degrees of freedom

    Number of iterations to convergence: 6
    Achieved convergence tolerance: 1.608e-06

.. code:: R

    confint(fit)

::

    Waiting for profiling to be done...
               2.5%        97.5%
    S0 1.319762e+07 1.362273e+07
    S1 3.009078e+05 8.199737e+05
    Kd 5.313128e+01 6.407096e+01


.. code:: R

    library(nlstools)
    set.seed(4)
    nb = nlsBoot(fit, niter=999)
    summary(nb)

::


    ------
    Bootstrap statistics
           Estimate   Std. error
    S0 1.341194e+07 70877.210905
    S1 5.662159e+05 88287.461199
    Kd 5.827538e+01     1.834132

    ------
    Median of bootstrap estimates and percentile confidence intervals
             Median         2.5%        97.5%
    S0 1.341055e+07 1.328220e+07 1.354591e+07
    S1 5.662406e+05 4.108650e+05 7.369380e+05
    Kd 5.826980e+01 5.477262e+01 6.188986e+01

.. code:: R

    plot(nb)

.. image:: ../_static/r_bs.png

using rpy2
^^^^^^^^^^

.. code:: python

    globalenv['Rdf'] = df

    fit = rpy2.robjects.r('nls(F ~ (S0 + S1 * cl / Kd)/ (1 + cl / Kd), start = list(S0=7e7, S1=0, Kd=12), data=Rdf) ')
    globalenv['rfit'] = fit
    print(r.summary(fit))
    print(r.confint(fit))
    %R print("")
    %R print(confint(rfit))

::

    R[write to console]: Waiting for profiling to be done...

    Formula: F ~ (S0 + S1 * cl/Kd)/(1 + cl/Kd)

    Parameters:
        Estimate Std. Error t value Pr(>|t|)
    S0 1.341e+07  8.713e+04 153.894 5.08e-12 ***
    S1 5.635e+05  1.064e+05   5.296  0.00184 **
    Kd 5.832e+01  2.247e+00  25.957 2.16e-07 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 118200 on 6 degrees of freedom

    Number of iterations to convergence: 6
    Achieved convergence tolerance: 1.608e-06


    [[1.31976181e+07 1.36227292e+07]
     [3.00907829e+05 8.19973684e+05]
     [5.31312807e+01 6.40709639e+01]]
    [1] ""
    R[write to console]: Waiting for profiling to be done...
               2.5%        97.5%
    S0 1.319762e+07 1.362273e+07
    S1 3.009078e+05 8.199737e+05
    Kd 5.313128e+01 6.407096e+01

::

    array([[1.31976181e+07, 1.36227292e+07],
           [3.00907829e+05, 8.19973684e+05],
           [5.31312807e+01, 6.40709639e+01]])

With older versions Rpy2 output looked nicer

.. code:: python

    print(MASS.confint_nls(fit, 'Kd'))
    print(rpy2.robjects.r('summary(rfit)'))

::

    R[write to console]: Waiting for profiling to be done...
    [53.13128065 64.07096394]

    Formula: F ~ (S0 + S1 * cl/Kd)/(1 + cl/Kd)

    Parameters:
        Estimate Std. Error t value Pr(>|t|)
    S0 1.341e+07  8.713e+04 153.894 5.08e-12 ***
    S1 5.635e+05  1.064e+05   5.296  0.00184 **
    Kd 5.832e+01  2.247e+00  25.957 2.16e-07 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 118200 on 6 degrees of freedom

    Number of iterations to convergence: 6
    Achieved convergence tolerance: 1.608e-06

.. code:: python

    nlstools = importr('nlstools')
    base = importr('base')
    base.set_seed(4)
    nb = nlstools.nlsBoot(fit, niter=999)
    globalenv['nb'] = nb
    globalenv['fit'] = fit

.. code:: python

    %%R
    plot(nb)
    summary(nb)

::


    ------
    Bootstrap statistics
           Estimate   Std. error
    S0 1.341194e+07 70877.210905
    S1 5.662159e+05 88287.461199
    Kd 5.827538e+01     1.834132

    ------
    Median of bootstrap estimates and percentile confidence intervals
             Median         2.5%        97.5%
    S0 1.341055e+07 1.328220e+07 1.354591e+07
    S1 5.662406e+05 4.108650e+05 7.369380e+05
    Kd 5.826980e+01 5.477262e+01 6.188986e+01

.. image:: ../_static/rpy_bs.png

lmfit
^^^^^

.. code:: python

    import lmfit

    def residual(pars, x, y=None):
        S0 =  pars['S0']
        S1 =  pars['S1']
        Kd = pars['Kd']
        model = (S0 + S1 * x / Kd) / (1 + x / Kd)
        if y is None:
            return model
        return (y - model)

    params = lmfit.Parameters()
    params.add('S0', value=df.F[0])
    params.add('S1', value=100)
    params.add('Kd', value=50, vary=True)

    out = lmfit.minimize(residual, params, args=(df.cl, df.F,))

    xdelta = (df.cl.max() - df.cl.min()) / 500
    xfit = np.arange(df.cl.min() - xdelta, df.cl.max() + xdelta, xdelta)
    yfit = residual(out.params, xfit)
    print(lmfit.fit_report(out.params))
    plt.plot(df.cl, df.F, "o", xfit, yfit, "-")

::

    [[Variables]]
        S0:  13408867.7 +/- 87130.4207 (0.65%) (init = 1.33579e+07)
        S1:  563536.896 +/- 106411.773 (18.88%) (init = 100)
        Kd:  58.3187813 +/- 2.24670302 (3.85%) (init = 50)
    [[Correlations]] (unreported correlations are < 0.100)
        C(S1, Kd) = -0.712
        C(S0, Kd) = -0.656
        C(S0, S1) = 0.275

.. image:: ../_static/lmfit1.png

.. code:: python

    import lmfit
    def residuals(p):
        S0 =  p['S0']
        S1 =  p['S1']
        Kd = p['Kd']
        model = (S0 + S1 * df.cl / Kd) / (1 + df.cl / Kd)
        return (model - df.F)

    mini = lmfit.Minimizer(residuals, params)
    res = mini.minimize()
    ci, tr = lmfit.conf_interval(mini, res, sigmas=[.68, .95], trace=True)
    print(lmfit.ci_report(ci, with_offset=False, ndigits=2))
    print(lmfit.fit_report(res, show_correl=False, sort_pars=True))

::

        95.00% 68.00% _BEST_ 68.00% 95.00%
     S0:13197616.3413314946.3513408867.6813503300.3813622729.13
     S1:300911.47447991.63563536.93677869.66819977.61
     Kd:  53.13  55.96  58.32  60.79  64.07
    [[Fit Statistics]]
        # fitting method   = leastsq
        # function evals   = 17
        # data points      = 9
        # variables        = 3
        chi-square         = 8.3839e+10
        reduced chi-square = 1.3973e+10
        Akaike info crit   = 212.594471
        Bayesian info crit = 213.186145
    [[Variables]]
        Kd:  58.3187808 +/- 2.24670301 (3.85%) (init = 50)
        S0:  13408867.7 +/- 87130.4216 (0.65%) (init = 1.33579e+07)
        S1:  563536.932 +/- 106411.771 (18.88%) (init = 100)

.. code:: python

    names = res.params.keys()
    i = 0
    gs = plt.GridSpec(4, 4)
    sx = {}
    sy = {}
    for fixed in names:
        j = 0
        for free in names:
            if j in sx and i in sy:
                ax = plt.subplot(gs[i, j], sharex=sx[j], sharey=sy[i])
            elif i in sy:
                ax = plt.subplot(gs[i, j], sharey=sy[i])
                sx[j] = ax
            elif j in sx:
                ax = plt.subplot(gs[i, j], sharex=sx[j])
                sy[i] = ax
            else:
                ax = plt.subplot(gs[i, j])
                sy[i] = ax
                sx[j] = ax
            if i < 3:
                plt.setp(ax.get_xticklabels(), visible=True)
            else:
                ax.set_xlabel(free)

            if j > 0:
                plt.setp(ax.get_yticklabels(), visible=False)
            else:
                ax.set_ylabel(fixed)

            rest = tr[fixed]
            prob = rest['prob']
            f = prob < 0.96

            x, y = rest[free], rest[fixed]
            ax.scatter(x[f], y[f], c=1-prob[f], s=25*(1-prob[f]+0.5))
            ax.autoscale(1, 1)
            j += 1
        i += 1

.. image:: ../_static/lmfit2.png

.. code:: python

    names = list(res.params.keys())

    plt.figure()
    for i in range(3):
        for j in range(3):
            indx = 9-j*3-i
            ax = plt.subplot(3, 3, indx)
            ax.ticklabel_format(style='sci', scilimits=(-2, 2), axis='y')

            # set-up labels and tick marks
            ax.tick_params(labelleft=False, labelbottom=False)
            if indx in (1, 4, 7):
                plt.ylabel(names[j])
                ax.tick_params(labelleft=True)
            if indx == 1:
                ax.tick_params(labelleft=True)
            if indx in (7, 8, 9):
                plt.xlabel(names[i])
                ax.tick_params(labelbottom=True)
                [label.set_rotation(45) for label in ax.get_xticklabels()]

            if i != j:
                x, y, m = lmfit.conf_interval2d(mini, res, names[i], names[j], 20, 20)
                plt.contourf(x, y, m, np.linspace(0, 1, 10))

                x = tr[names[i]][names[i]]
                y = tr[names[i]][names[j]]
                pr = tr[names[i]]['prob']
                s = np.argsort(x)
                plt.scatter(x[s], y[s], c=pr[s], s=30, lw=1)

            else:
                x = tr[names[i]][names[i]]
                y = tr[names[i]]['prob']

                t, s = np.unique(x, True)
                f = scipy.interpolate.interp1d(t, y[s], 'slinear')
                xn = np.linspace(x.min(), x.max(), 50)
                plt.plot(xn, f(xn), lw=1)
                plt.ylabel('prob')
                ax.tick_params(labelleft=True)

    plt.tight_layout()

.. image:: ../_static/lmfit3.png

.. code:: python

    lmfit.report_fit(out.params, min_correl=0.25)

    ci, trace = lmfit.conf_interval(mini, res, sigmas=[1, 2], trace=True)
    lmfit.printfuncs.report_ci(ci)

    fig, axes = plt.subplots(2, 2, figsize=(12.8, 9.6), sharey=True)
    cx1, cy1, prob = trace['S0']['S0'], trace['S0']['Kd'], trace['S0']['prob']
    cx2, cy2, prob2 = trace['S1']['S1'], trace['S1']['Kd'], trace['S1']['prob']

    axes[0][0].scatter(cx1, cy1, c=prob, s=30)
    axes[0][0].set_xlabel('S0')
    axes[0][0].set_ylabel('Kd')

    axes[0][1].scatter(cx2, cy2, c=prob2, s=30)
    axes[0][1].set_xlabel('S1')

    cx, cy, grid = lmfit.conf_interval2d(mini, res, 'S0', 'Kd', 30, 30)
    ctp = axes[1][0].contourf(cx, cy, grid, np.linspace(0, 1, 11))
    fig.colorbar(ctp, ax=axes[1][0])
    axes[1][0].set_xlabel('S0')
    axes[1][0].set_ylabel('Kd')

    cx, cy, grid = lmfit.conf_interval2d(mini, res, 'S1', 'Kd', 30, 30)
    ctp = axes[1][1].contourf(cx, cy, grid, np.linspace(0, 1, 11))
    fig.colorbar(ctp, ax=axes[1][1])
    axes[1][1].set_xlabel('S1')
    axes[1][1].set_ylabel('Kd')

.. code:: python

    x, y, grid = lmfit.conf_interval2d(mini, res, 'S0','S1', 30, 30)
    plt.contourf(x, y, grid, np.linspace(0,1,11))
    plt.xlabel('S0')
    plt.colorbar()
    plt.ylabel('S1')

.. image:: ../_static/lmfit5.png

::

    [[Variables]]
        S0:  13408867.7 +/- 87130.4207 (0.65%) (init = 1.33579e+07)
        S1:  563536.896 +/- 106411.773 (18.88%) (init = 100)
        Kd:  58.3187813 +/- 2.24670302 (3.85%) (init = 50)
    [[Correlations]] (unreported correlations are < 0.250)
        C(S1, Kd) = -0.712
        C(S0, Kd) = -0.656
        C(S0, S1) = 0.275
           95.45%    68.27%    _BEST_    68.27%    95.45%
     S0:-217226.57750-94491.3408813408867.68157+95008.90246+219989.07757
     S1:-270193.18488-116251.92509563536.93239+115024.55042+263650.95181
     Kd:  -5.32812  -2.37783  58.31878  +2.48963  +5.92424

.. image:: ../_static/lmfit4.png

Notes
~~~~~

You could implement global fitting using scipy.leastq but will sometime fail in bootstrapping.
lmfit resulted much more robust

.. code:: python

    def fit_pH_global(fz, x, dy1, dy2):
        """Fit 2 dataset (x, y1, y2) with a single protonation site model
        """
        y1 = np.array(dy1)
        y2 = np.array(dy2)

        def ssq(p, x, y1, y2):
            return np.r_[y1 - fz(p[0], p[1:3], x), y2 - fz(p[0], p[3:5], x)]
        p0 = np.r_[x[2], y1[0], y1[-1], y2[0], y2[-1]]
        p, cov, info, msg, success = optimize.leastsq(ssq, p0, args=(x, y1, y2),
                                                      full_output=True, xtol=1e-11)
        res = namedtuple("Result", "success msg df chisqr K sK SA_1 sSA_1 \
                         SB_1 sSB_1 SA_2 sSA_2 SB_2 sSB_2")
        res.msg = msg
        res.success = success
        if 1 <= success <= 4:
            chisq = sum(info['fvec'] * info['fvec'])
            res.df = len(y1) + len(y2) - len(p)
            res.chisqr = chisq / res.df
            res.K = p[0]
            #res.sK = np.sqrt(cov[0][0] * res.chisqr)
            res.SA_1 = p[1]
            #res.sSA_1 = np.sqrt(cov[1][1] * res.chisqr)
            res.SB_1 = p[2]
            #res.sSB_1 = np.sqrt(cov[2][2] * res.chisqr)
            res.SA_2 = p[3]
            #res.sSA_2 = np.sqrt(cov[3][3] * res.chisqr)
            res.SB_2 = p[4]
            #res.sSB_2 = np.sqrt(cov[4][4] * res.chisqr)
        return res

    result = fit_pH_global(fz, df.x, df.y1, df.y2)