How to use indigopy

Example code for how to use the indigopy package. The sample data used in this example notebook is derived from the INDIGO, INDIGO-MTB, and MAGENTA publications.

Set up environment

[1]:
# Import dependencies
import pandas as pd
from scipy.stats import spearmanr
from sklearn.metrics import r2_score, classification_report
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
import seaborn as sns
import matplotlib.pyplot as plt

# Import package functions
import sys
sys.path.append('c:/Users/carol/github/INDIGOpy/') # modify if testing locally in different machine; remove once package is published
from indigopy.core import load_sample, featurize, classify

Example: E. coli

The following analysis and results were originally reported in the INDIGO paper.
- Training dataset: 105 two-way interactions between 15 antibiotics
- Testing dataset: 66 two-way interactions between the 15 antibiotics in the training set + 4 new antibiotics
[2]:
# Load sample data
sample = load_sample('ecoli')

# Define input arguments
key             = sample['key']
profiles        = sample['profiles']
feature_names   = sample['feature_names']
train_ixns      = sample['train']['interactions']
train_scores    = sample['train']['scores']
test_ixns       = sample['test']['interactions']
test_scores     = sample['test']['scores']

# Determine ML features
train_data      = featurize(train_ixns, profiles, feature_names=feature_names, key=key, silent=True)
test_data       = featurize(test_ixns, profiles, feature_names=feature_names, key=key, silent=True)
X_train, X_test = train_data['feature_df'].to_numpy().transpose(), test_data['feature_df'].to_numpy().transpose()

# Determine class labels
thresh, classes = (-0.5, 2), ('S', 'N', 'A')
train_labels    = classify(train_scores, thresholds=thresh, classes=classes)
test_labels     = classify(test_scores, thresholds=thresh, classes=classes)

# Train and apply a regression-based model
reg_model = RandomForestRegressor()
reg_model.fit(X_train, train_scores)
reg_y = reg_model.predict(X_test)
r, p = spearmanr(test_scores, reg_y)
r2 = r2_score(test_scores, reg_y)
print('Regression results:')
print('\tSpearman R = {}'.format(round(r, 4)))
print('\tSpearman p = {:.3g}'.format(p))
print('\tR2 = {}'.format(round(r2, 4)))

# Train and apply a classification-based model
class_model = RandomForestClassifier()
class_model.fit(X_train, train_labels)
class_y = class_model.predict(X_test)
print('Classification results:')
print(classification_report(test_labels, class_y))
Defining INDIGO features: 100%|██████████| 105/105 [00:00<00:00, 406.95it/s]
Defining INDIGO features: 100%|██████████| 66/66 [00:00<00:00, 464.53it/s]
Regression results:
        Spearman R = 0.6575
        Spearman p = 2e-09
        R2 = 0.3895
Classification results:
              precision    recall  f1-score   support

           A       0.50      0.31      0.38        13
           N       0.68      0.90      0.78        42
           S       0.50      0.09      0.15        11

    accuracy                           0.65        66
   macro avg       0.56      0.43      0.44        66
weighted avg       0.61      0.65      0.59        66

Example: M. tuberculosis

The following analysis and results were originally reported in the INDIGO-MTB paper.
- Training dataset: 196 two- to five-way interactions between 40 antibacterials
- Testing dataset: 36 two- to three-way interactions between the 13 antibacterials
- Clinical dataset: clinical outcomes for 57 two- to five-way interactions between 7 antibacterials
[3]:
# Load sample data
sample = load_sample('mtb')

# Define input arguments
key             = sample['key']
profiles        = sample['profiles']
feature_names   = sample['feature_names']
train_ixns      = sample['train']['interactions']
train_scores    = sample['train']['scores']
test_ixns       = sample['test']['interactions']
test_scores     = sample['test']['scores']
clinical_ixns   = sample['clinical']['interactions']
clinical_scores = sample['clinical']['scores']

# Determine ML features
train_data      = featurize(train_ixns, profiles, feature_names=feature_names, key=key, silent=True)
test_data       = featurize(test_ixns, profiles, feature_names=feature_names, key=key, silent=True)
clinical_data   = featurize(clinical_ixns, profiles, feature_names=feature_names, key=key, silent=True)
X_train, X_test = train_data['feature_df'].to_numpy().transpose(), test_data['feature_df'].to_numpy().transpose()
X_clinical      = clinical_data['feature_df'].to_numpy().transpose()

# Determine class labels
thresh, classes = (0.9, 1.1), ('S', 'N', 'A')
train_labels    = classify(train_scores, thresholds=thresh, classes=classes)
test_labels     = classify(test_scores, thresholds=thresh, classes=classes)

# Train and apply a regression-based model
reg_model = RandomForestRegressor()
reg_model.fit(X_train, train_scores)
reg_y = reg_model.predict(X_test)
r, p = spearmanr(test_scores, reg_y)
r2 = r2_score(test_scores, reg_y)
print('Regression results:')
print('\tSpearman R = {}'.format(round(r, 4)))
print('\tSpearman p = {:.3g}'.format(p))
print('\tR2 = {}'.format(round(r2, 4)))

# Train and apply a classification-based model
class_model = RandomForestClassifier()
class_model.fit(X_train, train_labels)
class_y = class_model.predict(X_test)
print('Classification results:')
print(classification_report(test_labels, class_y))

# Apply model to clinical data
clinical_y = reg_model.predict(X_clinical)
r, p = spearmanr(clinical_scores, clinical_y)
print('Clinical results:')
print('\tSpearman R = {}'.format(round(-r, 4)))
print('\tSpearman p = {:.3g}'.format(p))
Defining INDIGO features: 100%|██████████| 196/196 [00:00<00:00, 512.41it/s]
Defining INDIGO features: 100%|██████████| 36/36 [00:00<00:00, 538.73it/s]
Defining INDIGO features: 100%|██████████| 57/57 [00:00<00:00, 552.91it/s]
Regression results:
        Spearman R = 0.5348
        Spearman p = 0.000779
        R2 = 0.122
Classification results:
              precision    recall  f1-score   support

           A       0.62      0.31      0.42        16
           N       0.00      0.00      0.00         1
           S       0.73      0.84      0.78        19

    accuracy                           0.58        36
   macro avg       0.45      0.38      0.40        36
weighted avg       0.66      0.58      0.60        36

Clinical results:
        Spearman R = 0.5035
        Spearman p = 6.55e-05

Example: S. aureus

The following analysis and results were originally reported in the INDIGO paper.
- Training dataset: 171 two-way interactions between 19 antibiotics measured in E. coli
- Testing dataset: 45 two-way interactions between the 10 antibiotics measured in S. aureus
[4]:
# Load sample data
sample = load_sample('saureus')

# Define input arguments
key             = sample['key']
profiles        = sample['profiles']
feature_names   = sample['feature_names']
train_ixns      = sample['train']['interactions']
train_scores    = sample['train']['scores']
test_ixns       = sample['test']['interactions']
test_scores     = sample['test']['scores']
strains         = sample['orthology']['strains']
orthology_map   = sample['orthology']['map']

# Determine ML features
train_data      = featurize(train_ixns, profiles, feature_names=feature_names, key=key, silent=True)
test_data       = featurize(test_ixns, profiles, feature_names=feature_names, key=key, silent=True,
                            strains=strains, orthology_map=orthology_map)
X_train, X_test = train_data['feature_df'].to_numpy().transpose(), test_data['feature_df'].to_numpy().transpose()

# Determine class labels
thresh, classes = (-0.5, 2), ('S', 'N', 'A')
train_labels    = classify(train_scores, thresholds=thresh, classes=classes)
test_labels     = classify(test_scores, thresholds=thresh, classes=classes)

# Train and apply a regression-based model
reg_model = RandomForestRegressor()
reg_model.fit(X_train, train_scores)
reg_y = reg_model.predict(X_test)
r, p = spearmanr(test_scores, reg_y)
r2 = r2_score(test_scores, reg_y)
print('Regression results:')
print('\tSpearman R = {}'.format(round(r, 4)))
print('\tSpearman p = {:.3g}'.format(p))
print('\tR2 = {}'.format(round(r2, 4)))

# Train and apply a classification-based model
class_model = RandomForestClassifier()
class_model.fit(X_train, train_labels)
class_y = class_model.predict(X_test)
print('Classification results:')
print(classification_report(test_labels, class_y))
Defining INDIGO features: 100%|██████████| 171/171 [00:00<00:00, 508.65it/s]
Defining INDIGO features: 100%|██████████| 45/45 [00:00<00:00, 569.02it/s]
Mapping orthologous genes: 100%|██████████| 1/1 [00:01<00:00,  1.99s/it]
Regression results:
        Spearman R = 0.5428
        Spearman p = 0.000117
        R2 = -1.257
Classification results:
              precision    recall  f1-score   support

           A       0.00      0.00      0.00         2
           N       0.49      1.00      0.66        22
           S       0.00      0.00      0.00        21

    accuracy                           0.49        45
   macro avg       0.16      0.33      0.22        45
weighted avg       0.24      0.49      0.32        45

c:\Users\carol\AppData\Local\Programs\PythonCodingPack\lib\site-packages\sklearn\metrics\_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Example: A. baumannii

The following analysis and results were originally reported in the MAGENTA paper.
- Training dataset: 338 two- to three-way interactions between 24 antibiotics measured in E. coli cultured in various media conditions
- Testing dataset: 45 two-way interactions between the 8 antibiotics measured in A. baumannii
[5]:
# Load sample data
sample = load_sample('abaumannii')

# Define input arguments
key             = sample['key']
profiles        = sample['profiles']
feature_names   = sample['feature_names']
train_ixns      = sample['train']['interactions']
train_scores    = sample['train']['scores']
test_ixns       = sample['test']['interactions']
test_scores     = sample['test']['scores']
strains         = sample['orthology']['strains']
orthology_map   = sample['orthology']['map']

# Determine ML features
train_data      = featurize(train_ixns, profiles, feature_names=feature_names, key=key, silent=True)
test_data       = featurize(test_ixns, profiles, feature_names=feature_names, key=key, silent=True,
                            strains=strains, orthology_map=orthology_map)
X_train, X_test = train_data['feature_df'].to_numpy().transpose(), test_data['feature_df'].to_numpy().transpose()

# Determine class labels
thresh, classes = (-0.5, 0), ('S', 'N', 'A')
train_labels    = classify(train_scores, thresholds=thresh, classes=classes)
test_labels     = classify(test_scores, thresholds=thresh, classes=classes)

# Train and apply a regression-based model
reg_model = RandomForestRegressor()
reg_model.fit(X_train, train_scores)
reg_y = reg_model.predict(X_test)
r, p = spearmanr(test_scores, reg_y)
r2 = r2_score(test_scores, reg_y)
print('Regression results:')
print('\tSpearman R = {}'.format(round(r, 4)))
print('\tSpearman p = {:.3g}'.format(p))
print('\tR2 = {}'.format(round(r2, 4)))

# Train and apply a classification-based model
class_model = RandomForestClassifier()
class_model.fit(X_train, train_labels)
class_y = class_model.predict(X_test)
print('Classification results:')
print(classification_report(test_labels, class_y))

# Visualize results
df = pd.DataFrame({'x': test_labels, 'y': reg_y})
df.replace({'A': 'Antagonism', 'N': 'Neutral', 'S': 'Synergy'}, inplace=True)
sns.set(rc={'figure.figsize':(3, 4)})
ax = sns.boxplot(x='x', y='y', data=df, order=['Antagonism', 'Neutral', 'Synergy'], )
ax.set(title='A. baumannii', xlabel='True class', ylabel='Predicted score')
plt.show()
Defining INDIGO features: 100%|██████████| 338/338 [00:00<00:00, 497.36it/s]
Defining INDIGO features: 100%|██████████| 45/45 [00:00<00:00, 589.56it/s]
Mapping orthologous genes: 100%|██████████| 1/1 [00:02<00:00,  2.58s/it]
Regression results:
        Spearman R = 0.614
        Spearman p = 7.28e-06
        R2 = -0.3302
Classification results:
              precision    recall  f1-score   support

           A       0.43      0.94      0.59        17
           N       0.00      0.00      0.00        11
           S       0.86      0.35      0.50        17

    accuracy                           0.49        45
   macro avg       0.43      0.43      0.36        45
weighted avg       0.49      0.49      0.41        45

../_images/notebooks_indigopy_demo_9_2.png