Data Management

class menten_gcn.DataHolder(dtype: numpy.dtype = <class 'numpy.float32'>)[source]

DataHolder is a wonderful class that automatically stores the direct output of the DataMaker. The DataHolder can then feed your data directly into kera’s model.fit() method using the generators below.

There are descriptions for each method below but perhaps the best way to grasp the DataHolder’s usage is to see the example at the bottom.

Parameters

dtype (np.dtype) – What NumPy dtype should we use to represent your data?

append(X: numpy.ndarray, A: numpy.ndarray, E: numpy.ndarray, out: numpy.ndarray)[source]

This is the most important method in this class: it gives the data to the dataholder.

Parameters
  • X (array-like) – Node features, shape=(N,F)

  • A (array-like) – Adjacency Matrix, shape=(N,N)

  • E (array-like) – Edge features, shape=(N,N,S)

  • out (array-like) – What is the output of your model supposed to be? You decide the shape.

assert_mode(mode=4)[source]

For those of you using spektral, this ensures that your data is in the correct shape. Unfortunately this only currently checks X and A. More development is incoming

save_to_file(fileprefix: str)[source]

Want to save this data for later? Use this method to cache it to disk.

Users of this method may be interested in the CachedDataHolderInputGenerator below

Parameters

fileprefix (str) – Filename prefix for cache. fileprefix=”foo/bar” will result in creating “./foo/bar.npz”

load_from_file(fileprefix: Optional[str] = None, filename: Optional[str] = None)[source]

save_to_file’s partner. Use this to load in caches already saved. Please provide either fileprefix or filename, but not both.

This duplicity may seem silly. The goal for fileprefix is to be consistant with save_to_file (the two “fileprefix” args will be identical strings for both) whereas the goal for filename is to simply list the name of the file verbosely.

Parameters
  • fileprefix (str) – Filename prefix for cache. fileprefix=”foo/bar” will result in reading “./foo/bar.npz”

  • filename (str) – Filename for cache. fileprefix=”foo/bar.npz” will result in reading “./foo/bar.npz”

Example:

def get_data_from_poses( pose_filenames, data_maker: menten_gcn.DataMaker ):

    """
    This function will load in many poses from disk and store their GCN tensors.
    Some parts of this will look different for you.
    In this case, we are making N graphs per pose where N is the number of residues in that pose.
    Each residue is the center of one graph.
    This is just an example, yours may look different.
    The point is that we are making many X, A, E, and out tensors and storing them in the DataHolder.
    """


    dataholder = menten_gcn.DataHolder()
    for filename in pose_filenames:
        pose = pose_from_pdb( filename ) #Rosetta
        wrapped_pose = RosettaPoseWrapper( pose=pose )
        for resid in range( 1, pose.size() + 1 ):
             X, A, E, meta = data_maker.generate_input_for_resid( wrapped_pose, resid )
             out = foo() #what should the output of the network be?
             dataholder.append( X=X, A=A, E=E, out=out )

    #optionally create an npz
    dataholder.save_to_file( "gcn_data" ) #creates gcn_data.npz

    return dataholder

See below for a full example

class menten_gcn.DecoratorDataCache(wrapped_pose: menten_gcn.wrappers.WrappedPose)[source]

DecoratorDataCache prevents re-calculating the same node/edge data many times. You will need to create a different cache for each pose you work with.

Also, we highly recommend you make this inside the DataMaker (calling data_maker.make_data_cache() ). This allows for further caching and speedups.

Parameters

wrapped_pose (WrappedPose) – Please pass the pose that we should make a cache for

Example:

def get_data_from_poses( pose_filenames, data_maker: menten_gcn.DataMaker ):
    """
    This function will load in many poses from disk and store their GCN tensors.
    Some parts of this will look different for you.
    In this case, we are making N graphs per pose where N is the number of residues in that pose.
    Each residue is the center of one graph.
    This is just an example, yours may look different.
    The point is that we are making many X, A, E, and out tensors and storing them in the DataHolder.
    """

    dataholder = menten_gcn.DataHolder()
    for filename in pose_filenames:
        pose = pose_from_pdb( filename ) #Rosetta
        wrapped_pose = RosettaPoseWrapper( pose=pose )
        cache = data_maker.make_data_cache( wrapped_pose )

        for resid in range( 1, pose.size() + 1 ):
             X, A, E, meta = data_maker.generate_input_for_resid( wrapped_pose, resid, data_cache = cache )
             out = foo() #what should the output of the network be?
             dataholder.append( X=X, A=A, E=E, out=out )

    #optionally create an npz
    dataholder.save_to_file( "gcn_data" ) #creates gcn_data.npz

    return dataholder

See below for a full example

class menten_gcn.DataHolderInputGenerator(data_holder: menten_gcn.data_management.DataHolder, batch_size: int = 32)[source]

This class is used to feed a DataHolder directly into Keras’s model.fit() protocol. See the example code below.

Parameters
  • data_holder (DataHolder) – A DataHolder that you just made

  • batch_size (int) – How many elements should be grouped together in batches during training?

Example:

#Setup: (See above for get_data_from_poses)
data_maker = make_datamaker() #Hand-wavy
train_poses = [ "A.pdb", "B.pdb", "C.pdb" ]
train_dataholder = get_data_from_poses( train_poses, data_maker )
val_poses = [ "D.pdb", "E.pdb" ]
val_dataholder = get_data_from_poses( val_poses, data_maker )

#Important Part:
train_generator = DataHolderInputGenerator( train_dataholder )
val_generator = DataHolderInputGenerator( val_dataholder )
model.fit( train_generator, validation_data=val_generator, epochs=100 )
class menten_gcn.CachedDataHolderInputGenerator(data_list_lines: List[str], cache: bool = False, batch_size: int = 32, autoshuffle: Optional[bool] = None)[source]

This class is used to feed a DataHolder directly into Keras’s model.fit() protocol.

The difference with this class is that it reads one or more DataHolders that have been saved onto disk.

See the example code below.

Parameters
  • data_list_lines (list) – A list of filenames, each one for a different DataHolder.

  • cache (bool) – If true, this class will load every DataHolder into memory once and keep them there. This can require a lot of memory. Otherwise, we will only read in one DataHolder at a time (once per epoch). This increases disk IO but is often worth it.

  • batch_size (int) – How many elements should be grouped together in batches during training?

  • autoshuffle (bool) – This is very nuanced so we recommend keeping the default value of None (this lets us pick the appropriate action). Long story short: YOU DO NOT WANT TO DO SHUFFLE=TRUE inside keras’s model.fit() when cache=False because disk IO goes through the roof. To counter this, we handle shuffling internally in a way that minimizes disk IO. However you DO WANT TO DO SHUFFLE=TRUE if cache=True because everything is in memory anyways. I know this is confusing. Maybe this will be cleaner in the future.

Example:

training_generator = CachedDataHolderInputGenerator( training_data_filenames, cache=False, batch_size=64 )
validation_generator = CachedDataHolderInputGenerator( validation_data_filenames, cache=False, batch_size=64 )
model.fit( training_generator, validation_data=validation_generator, epochs=1000, shuffle=False )
# Note shuffle=False
# CachedDataHolderInputGenerator does all shuffling internally to minimize disk access

See below for a full example

Full Example

Let’s say we want to create a model that predicts the solvent accessible surface area of a residue, given the residue and its surroundings. A single pass of the network will only predict one residue (the focus residue) and will include up to 19 neighbor nodes.

We have tons of data (10000 pdb files, for example) local on disk:

>>> ls inputs/*
inputs/00001.pdb inputs/00002.pdb ... inputs/10000.pdb

Keep in mind that we’re storing a single data point for every residue of these poses. So if the average pose has 150 residues, we will end up with 10000 * 150 = 1.5 Million training points. This will take a lot of memory to hold. We should group this into, say, batches of 50 poses each

>>> ls inputs/* | shuf | split -dl 50 - list
>>> ls ./list*
list001 list002 ... list200
>>> # 200 makes sense right? 10000 / 50 = 200
>>> wc -l list001
50
>>> head list001
inputs/03863.pdb
inputs/00134.pdb
inputs/00953.pdb
inputs/02387.pdb
inputs/09452.pdb

We’re then going to feed each list into:

# Let's call this make_data.py

import pyrosetta
pyrosetta.init()

import menten_gcn
import menten_gcn.decorators as decs

import numpy as np

def run( listname: str ):
    #listname is list001 or list002 or so on

    dataholder = menten_gcn.DataHolder()

    decorators = [ decs.StandardBBGeometry(), decs.Sequence() ]
    data_maker = menten_gcn.DataMaker( decorators=decorators, edge_distance_cutoff_A=10.0, max_residues=20 )
    data_maker.summary()

    sasa_calc = pyrosetta.rosetta.core.simple_metrics.per_residue_metrics.PerResidueSasaMetric()

    listfile = open( listname, "r" )
    for line in listfile:

        pose = pose_from_pdb( filename.rstrip() )
        wrapped_pose = RosettaPoseWrapper( pose )
        cache = data_maker.make_data_cache( wrapped_pose )

        sasa_results = sasa_calc.calculate( pose )

        for resid in range( 1, pose.size() + 1 ):
             X, A, E, meta = data_maker.generate_input_for_resid( wrapped_pose, resid, data_cache=cache )
             out = np.asarray( [ sasa_results[ resid ] ] )
             dataholder.append( X=X, A=A, E=E, out=out )

    dataholder.save_to_file( listname ) #creates list001.npz, for example

    # this is a good time for "del dataholder" and garbage collection

if __name__ == '__main__':
    assert len( sys.argv ) == 2, "Please pass the list file name as the one and only argument"
    listname = sys.argv[ 1 ]
    run( listname )
>>> ls ./list* | xargs -n1 python3 make_data.py
>>> # ^ run xargs -n1 -P N python3 make_data.py to run this in parallel on N processors
>>> ls ./list*.npz
list001.npz list002.npz ... list200.npz

Okay now we have all of our training data on disk. Let’s train

# train.py

from spektral.layers import *
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

import menten_gcn
import menten_gcn.decorators as decs

import numpy as np

def make_model():

    """
    This is just a simple model
    Model building is not the point of this example
    """

    # Be sure to use the same data_maker configuration as before
    # Otherwise the tensor sizes may not be the same
    decorators = [ decs.StandardBBGeometry(), decs.Sequence() ]
    data_maker = menten_gcn.DataMaker( decorators=decorators, edge_distance_cutoff_A=10.0, max_residues=20 )


    X_in, A_in, E_in = data_maker.generate_XAE_input_tensors()
    X1 = EdgeConditionedConv( 30, activation='relu' )([X_in, A_in, E_in])
    X2 = EdgeConditionedConv( 30, activation='relu' )([X1, A_in, E_in])
    FinalPool = GlobalSumPool()(X2)
    output = Dense( 1, name="out" )(FinalPool)

    model = Model(inputs=[X_in,A_in,E_in], outputs=output)
    model.compile(optimizer='adam', loss='mean_squared_error' )
    model.summary()

    return model

if __name__ == '__main__':
    assert len( sys.argv ) > 1, "Please pass the npz files as arguments"
    npznames = sys.argv[1:]

    # use 20% for validation
    fifth = int(len(data_list_lines)/5)
    training_data_filenames = npznames[fifth:]
    validation_data_filenames = npznames[:fifth]

    training_generator = menten_gcn.CachedDataHolderInputGenerator( training_data_filenames, cache=False, batch_size=64 )
    validation_generator = menten_gcn.CachedDataHolderInputGenerator( validation_data_filenames, cache=False, batch_size=64, autoshuffle=False ) #Note autoshuffle=False is recommended for validation data

    model = make_model()
    model.fit( training_generator, validation_data=validation_generator, epochs=1000, shuffle=False )
    model.save( "my_model.h5" )
>>> python3 train.py ./list*.npz
>>> ls *.h5
my_model.h5

Okay we’re done! So why did we deal with all that effort with caching on disk?

Your mileage may vary, but I find that I end up with more data than can fit in my system’s memory. It’s actually reasonably fast to just keep all of the data on disk and read it in each epoch, especially for you SSD users.

We were able to train this entire model with no more than two DataHolders loaded into memory at any given time. Given that we split our data into 200 DataHolders, this is a 100x decrease is memory usage!