Previous changeset 1:232b874046a7 (2023-08-10) |
Commit message:
add survival and collapsed life table outputs suggested by Wolfgang |
modified:
lifelines_tool/README.md lifelines_tool/lifelineskmcph.xml lifelines_tool/plotlykm.py lifelines_tool/test-data/readme_sample |
added:
lifelines_tool/agepartialrossi.png lifelines_tool/parolepartialrossi.png lifelines_tool/run_log.txt |
b |
diff -r 232b874046a7 -r dd5e65893cb8 lifelines_tool/README.md --- a/lifelines_tool/README.md Thu Aug 10 07:15:22 2023 +0000 +++ b/lifelines_tool/README.md Thu Aug 10 22:52:45 2023 +0000 |
[ |
@@ -1,28 +1,40 @@ # lifelines_tool - lifelines statistical package wrapped as a Galaxy tool. -## Galaxy tool to run failure time models using lifelines +A Galaxy tool for right censored failure time data. + +Provides Kaplan-Meier plots with confidence intervals, and optional Cox proportional hazards models -## Install to your Galaxy server from the toolshed - search for lifelines_km_cph_tool owned by fubar2 +Uses the [lifelines](https://lifelines.readthedocs.io/en/latest/index.html) package. + +### Install to your Galaxy server from the toolshed - search for lifelines_km_cph_tool owned by fubar2 -### More at https://lazarus.name/demo/ +More at https://lazarus.name/demo/ + +#### Using the Rossi sample recidivism data from lifelines: -#### Using the Rossi sample input data from lifelines, tool outputs include: +Runs Kaplan-Meier and generates a plot. Optional grouping variable. + +Plots show confidence intervals ![KM plot sample](lifelines_rossi_km.png) -and -![KM plot sample](lifelines_rossi_schoenfeld.png) -and + +If 2 groups, runs a log-rank test for difference. + ![KM plot sample](lifelines_report.png) - -Runs Kaplan-Meier and generates a plot. Optional grouping variable. -If 2 groups, runs a log-rank test for difference. -Plots show confidence intervals +If a comma separated list (for example: prio, age, race, mar, fin) of covariate column names is provided, +a Cox proportional hazards model is run, the assumption of proportionality is tested, and +recommendations made. +![KM plot sample](lifelines_rossi_schoenfeld.png) -If a list of covariate column names is provided, these are used in a -Cox Proportional Hazards model with tests for proportionality. +Also included are partial plots for each covariate like these from the Rossi recidivism lifelines sample data +used in the tool test. -Should work with any tabular data with the required columns - time and status for observations. +![C-PH partial plot samples](agepartialrossi.png) + +![C-PH partial plot samples](parolepartialrossi.png) + +Uses pandas read_csv with tabular delimiters so should work with any tabular data with the required columns - time and status for observations. Issues to https://github.com/fubar2/lifelines_tool please. Autogenerated so pull requests are possibly meaningless but regeneration of a new version should work. |
b |
diff -r 232b874046a7 -r dd5e65893cb8 lifelines_tool/agepartialrossi.png |
b |
Binary file lifelines_tool/agepartialrossi.png has changed |
b |
diff -r 232b874046a7 -r dd5e65893cb8 lifelines_tool/lifelineskmcph.xml --- a/lifelines_tool/lifelineskmcph.xml Thu Aug 10 07:15:22 2023 +0000 +++ b/lifelines_tool/lifelineskmcph.xml Thu Aug 10 22:52:45 2023 +0000 |
[ |
@@ -1,6 +1,6 @@ <tool name="lifelineskmcph" id="lifelineskmcph" version="0.01"> <!--Source in git at: https://github.com/fubar2/galaxy_tf_overlay--> - <!--Created by toolfactory@galaxy.org at 10/08/2023 15:48:43 using the Galaxy Tool Factory.--> + <!--Created by toolfactory@galaxy.org at 10/08/2023 21:59:53 using the Galaxy Tool Factory.--> <description>Lifelines KM and optional Cox PH models</description> <requirements> <requirement version="1.5.3" type="package">pandas</requirement> @@ -104,11 +104,11 @@ else: colsok = (args.time in defaultcols) and (args.status in defaultcols) if colsok: - sys.stderr.write('replacing first row of data derived header %s with %s' % (testcols, defaultcols)) + print('Replacing first row of data derived header %s with %s' % (testcols, defaultcols)) df.columns = defaultcols else: sys.stderr.write('## CRITICAL USAGE ERROR (not a bug!): time %s and status %s do not match anything in the file header, supplied header or automatic default column names %s' % (args.time, args.status, defaultcols)) -print('## Lifelines tool starting.\nUsing data header =', df.columns, 'time column =', args.time, 'status column =', args.status) +print('## Lifelines tool\nInput data header =', df.columns, 'time column =', args.time, 'status column =', args.status) os.makedirs(args.image_dir, exist_ok=True) fig, ax = plt.subplots() if args.group > '': @@ -136,6 +136,24 @@ ax.set_title(args.title) fig.savefig(os.path.join(args.image_dir,'KM_%s.png' % args.title)) print('#### No grouping variable, so no log rank or other Kaplan-Meier statistical output is available') +survdf = lifelines.utils.survival_table_from_events(df[args.time], df[args.status]) +lifedf = lifelines.utils.survival_table_from_events(df[args.time], df[args.status], collapse=True) +print("#### Survival table using time %s and event %s" % (args.time, args.status)) +with pd.option_context('display.max_rows', None, + 'display.max_columns', None, + 'display.precision', 3, + ): + print(survdf) +print("#### Life table using time %s and event %s" % (args.time, args.status)) +with pd.option_context('display.max_rows', None, + 'display.max_columns', None, + 'display.precision', 3, + ): + print(lifedf) +outpath = os.path.join(args.image_dir,'survival_table.tabular') +survdf.to_csv(outpath, sep='\t') +outpath = os.path.join(args.image_dir,'life_table.tabular') +lifedf.to_csv(outpath, sep='\t') if len(args.cphcols) > 0: fig, ax = plt.subplots() ax.set_title('Cox-PH model: %s' % args.title) @@ -153,7 +171,7 @@ cph.fit(cphdf, duration_col=args.time, event_col=args.status) cph.print_summary() for i, cov in enumerate(colsdf.columns): - if ucolcounts[i] > 10: + if ucolcounts[i] > 10: # a hack - assume categories are sparse - if not imaginary quintiles will have to do v = pd.Series.tolist(cphdf[cov].quantile(QVALS)) vdt = df.dtypes[cov] if vdt == 'int64': |
b |
diff -r 232b874046a7 -r dd5e65893cb8 lifelines_tool/parolepartialrossi.png |
b |
Binary file lifelines_tool/parolepartialrossi.png has changed |
b |
diff -r 232b874046a7 -r dd5e65893cb8 lifelines_tool/plotlykm.py --- a/lifelines_tool/plotlykm.py Thu Aug 10 07:15:22 2023 +0000 +++ b/lifelines_tool/plotlykm.py Thu Aug 10 22:52:45 2023 +0000 |
[ |
@@ -98,6 +98,24 @@ ax.set_title(args.title) fig.savefig(os.path.join(args.image_dir,'KM_%s.png' % args.title)) print('#### No grouping variable, so no log rank or other Kaplan-Meier statistical output is available') +survdf = lifelines.utils.survival_table_from_events(df[args.time], df[args.status]) +lifedf = lifelines.utils.survival_table_from_events(df[args.time], df[args.status], collapse=True) +print("Survival table using time %s and event %s" % (args.time, args.status)) +with pd.option_context('display.max_rows', None, + 'display.max_columns', None, + 'display.precision', 3, + ): + print(survdf) +print("Life table using time %s and event %s" % (args.time, args.status)) +with pd.option_context('display.max_rows', None, + 'display.max_columns', None, + 'display.precision', 3, + ): + print(lifedf) +outpath = os.path.join(args.image_dir,'survival_table.tabular') +survdf.to_csv(outpath, sep='\t') +outpath = os.path.join(args.image_dir,'life_table.tabular') +lifedf.to_csv(outpath, sep='\t') if len(args.cphcols) > 0: fig, ax = plt.subplots() ax.set_title('Cox-PH model: %s' % args.title) |
b |
diff -r 232b874046a7 -r dd5e65893cb8 lifelines_tool/run_log.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/lifelines_tool/run_log.txt Thu Aug 10 22:52:45 2023 +0000 |
[ |
b"@@ -0,0 +1,168 @@\n+## Lifelines tool\n+Input data header = Index(['Unnamed: 0', 'week', 'arrest', 'fin', 'age', 'race', 'wexp', 'mar',\n+ 'paro', 'prio'],\n+ dtype='object') time column = week status column = arrest\n+#### No grouping variable, so no log rank or other Kaplan-Meier statistical output is available\n+Survival table using time week and event arrest\n+ removed observed censored entrance at_risk\n+event_at \n+0.0 0 0 0 432 432\n+1.0 1 1 0 0 432\n+2.0 1 1 0 0 431\n+3.0 1 1 0 0 430\n+4.0 1 1 0 0 429\n+5.0 1 1 0 0 428\n+6.0 1 1 0 0 427\n+7.0 1 1 0 0 426\n+8.0 5 5 0 0 425\n+9.0 2 2 0 0 420\n+10.0 1 1 0 0 418\n+11.0 2 2 0 0 417\n+12.0 2 2 0 0 415\n+13.0 1 1 0 0 413\n+14.0 3 3 0 0 412\n+15.0 2 2 0 0 409\n+16.0 2 2 0 0 407\n+17.0 3 3 0 0 405\n+18.0 3 3 0 0 402\n+19.0 2 2 0 0 399\n+20.0 5 5 0 0 397\n+21.0 2 2 0 0 392\n+22.0 1 1 0 0 390\n+23.0 1 1 0 0 389\n+24.0 4 4 0 0 388\n+25.0 3 3 0 0 384\n+26.0 3 3 0 0 381\n+27.0 2 2 0 0 378\n+28.0 2 2 0 0 376\n+30.0 2 2 0 0 374\n+31.0 1 1 0 0 372\n+32.0 2 2 0 0 371\n+33.0 2 2 0 0 369\n+34.0 2 2 0 0 367\n+35.0 4 4 0 0 365\n+36.0 3 3 0 0 361\n+37.0 4 4 0 0 358\n+38.0 1 1 0 0 354\n+39.0 2 2 0 0 353\n+40.0 4 4 0 0 351\n+42.0 2 2 0 0 347\n+43.0 4 4 0 0 345\n+44.0 2 2 0 0 341\n+45.0 2 2 0 0 339\n+46.0 4 4 0 0 337\n+47.0 1 1 0 0 333\n+48.0 2 2 0 0 332\n+49.0 5 5 0 0 330\n+50.0 3 3 0 0 325\n+52.0 322 4 318 0 322\n+Life table using time week and event arrest\n+ removed observed censored at_risk\n+event_at \n+(-0.001, 13.844] 20 20 0 432\n+(13.844, 27.687] 36 36 0 412\n+(27.687, 41.531] 29 29 0 376\n+(41.531, 55.374] 347 29 318 347\n+### Lifelines test of Proportional Hazards results with prio, age, race, paro, mar, fin as covariates on test\n+<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>\n+ duration col "..b" 1.38 0.31 -0.28 0.92 0.75 2.52\n+paro -0.09 0.91 0.20 -0.47 0.29 0.62 1.34\n+mar -0.48 0.62 0.38 -1.22 0.25 0.30 1.29\n+fin -0.38 0.68 0.19 -0.75 -0.00 0.47 1.00\n+\n+ cmp to z p -log2(p)\n+covariate \n+prio 0.00 3.53 <0.005 11.26\n+age 0.00 -2.95 <0.005 8.28\n+race 0.00 1.04 0.30 1.75\n+paro 0.00 -0.46 0.65 0.63\n+mar 0.00 -1.28 0.20 2.32\n+fin 0.00 -1.98 0.05 4.40\n+---\n+Concordance = 0.63\n+Partial AIC = 1330.00\n+log-likelihood ratio test = 32.77 on 6 df\n+-log2(p) of ll-ratio test = 16.39\n+\n+\n+ Bootstrapping lowess lines. May take a moment...\n+\n+\n+ Bootstrapping lowess lines. May take a moment...\n+\n+The ``p_value_threshold`` is set at 0.01. Even under the null hypothesis of no violations, some\n+covariates will be below the threshold by chance. This is compounded when there are many covariates.\n+Similarly, when there are lots of observations, even minor deviances from the proportional hazard\n+assumption will be flagged.\n+\n+With that in mind, it's best to use a combination of statistical tests and visual tests to determine\n+the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``\n+and looking for non-constant lines. See link [A] below for a full example.\n+\n+<lifelines.StatisticalResult: proportional_hazard_test>\n+ null_distribution = chi squared\n+degrees_of_freedom = 1\n+ model = <lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>\n+ test_name = proportional_hazard_test\n+\n+---\n+ test_statistic p -log2(p)\n+age km 6.99 0.01 6.93\n+ rank 7.40 0.01 7.26\n+fin km 0.02 0.90 0.15\n+ rank 0.01 0.91 0.13\n+mar km 1.64 0.20 2.32\n+ rank 1.80 0.18 2.48\n+paro km 0.06 0.81 0.31\n+ rank 0.07 0.79 0.34\n+prio km 0.92 0.34 1.57\n+ rank 0.88 0.35 1.52\n+race km 1.70 0.19 2.38\n+ rank 1.68 0.19 2.36\n+\n+\n+1. Variable 'age' failed the non-proportional test: p-value is 0.0065.\n+\n+ Advice 1: the functional form of the variable 'age' might be incorrect. That is, there may be\n+non-linear terms missing. The proportional hazard test used is very sensitive to incorrect\n+functional forms. See documentation in link [D] below on how to specify a functional form.\n+\n+ Advice 2: try binning the variable 'age' using pd.cut, and then specify it in `strata=['age',\n+...]` in the call in `.fit`. See documentation in link [B] below.\n+\n+ Advice 3: try adding an interaction term with your time variable. See documentation in link [C]\n+below.\n+\n+\n+ Bootstrapping lowess lines. May take a moment...\n+\n+\n+ Bootstrapping lowess lines. May take a moment...\n+\n+\n+ Bootstrapping lowess lines. May take a moment...\n+\n+\n+ Bootstrapping lowess lines. May take a moment...\n+\n+\n+---\n+[A] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html\n+[B] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it\n+[C] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates\n+[D] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form\n+[E] https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification\n+\n" |
b |
diff -r 232b874046a7 -r dd5e65893cb8 lifelines_tool/test-data/readme_sample --- a/lifelines_tool/test-data/readme_sample Thu Aug 10 07:15:22 2023 +0000 +++ b/lifelines_tool/test-data/readme_sample Thu Aug 10 22:52:45 2023 +0000 |
[ |
@@ -1,5 +1,5 @@ -## Lifelines tool starting. -Using data header = Index(['Unnamed: 0', 'week', 'arrest', 'fin', 'age', 'race', 'wexp', 'mar', +## Lifelines tool +Input data header = Index(['Unnamed: 0', 'week', 'arrest', 'fin', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'], dtype='object') time column = week status column = arrest Logrank test for race - 0 vs 1 @@ -14,6 +14,66 @@ --- test_statistic p -log2(p) 0.58 0.45 1.16 +#### Survival table using time week and event arrest + removed observed censored entrance at_risk +event_at +0.0 0 0 0 432 432 +1.0 1 1 0 0 432 +2.0 1 1 0 0 431 +3.0 1 1 0 0 430 +4.0 1 1 0 0 429 +5.0 1 1 0 0 428 +6.0 1 1 0 0 427 +7.0 1 1 0 0 426 +8.0 5 5 0 0 425 +9.0 2 2 0 0 420 +10.0 1 1 0 0 418 +11.0 2 2 0 0 417 +12.0 2 2 0 0 415 +13.0 1 1 0 0 413 +14.0 3 3 0 0 412 +15.0 2 2 0 0 409 +16.0 2 2 0 0 407 +17.0 3 3 0 0 405 +18.0 3 3 0 0 402 +19.0 2 2 0 0 399 +20.0 5 5 0 0 397 +21.0 2 2 0 0 392 +22.0 1 1 0 0 390 +23.0 1 1 0 0 389 +24.0 4 4 0 0 388 +25.0 3 3 0 0 384 +26.0 3 3 0 0 381 +27.0 2 2 0 0 378 +28.0 2 2 0 0 376 +30.0 2 2 0 0 374 +31.0 1 1 0 0 372 +32.0 2 2 0 0 371 +33.0 2 2 0 0 369 +34.0 2 2 0 0 367 +35.0 4 4 0 0 365 +36.0 3 3 0 0 361 +37.0 4 4 0 0 358 +38.0 1 1 0 0 354 +39.0 2 2 0 0 353 +40.0 4 4 0 0 351 +42.0 2 2 0 0 347 +43.0 4 4 0 0 345 +44.0 2 2 0 0 341 +45.0 2 2 0 0 339 +46.0 4 4 0 0 337 +47.0 1 1 0 0 333 +48.0 2 2 0 0 332 +49.0 5 5 0 0 330 +50.0 3 3 0 0 325 +52.0 322 4 318 0 322 +#### Life table using time week and event arrest + removed observed censored at_risk +event_at +(-0.001, 13.844] 20 20 0 432 +(13.844, 27.687] 36 36 0 412 +(27.687, 41.531] 29 29 0 376 +(41.531, 55.374] 347 29 318 347 ### Lifelines test of Proportional Hazards results with prio, age, race, paro, mar, fin as covariates on KM and CPH in lifelines test <lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations> duration col = 'week' @@ -22,7 +82,7 @@ number of observations = 432 number of events observed = 114 partial log-likelihood = -659.00 - time fit was run = 2023-08-10 05:49:04 UTC + time fit was run = 2023-08-10 12:00:14 UTC --- coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% |