***** Guide ***** Model Interface =============== A MISTK model implements the `mistk.model.AbstractModel` class. This class provides a set of abstract methods that represent the model lifecycle and must be implemented by the new model. Ultimately, these methods form the core of the web endpoint service that is made available for every model implementation. **do_initialize** (objectives: list, props: dict, hparams: dict) Called once the endpoint service has launched. This would typically be the first call made to the service. Perform any general setup and initialization. :param objectives: A list of model objectives intended to aid in its setup and initialization. Possible values are: 'training', 'prediction', 'streaming_prediction', 'generation', 'transfer_learning'. :param props: A dictionary of settings or configuration values that are passed from the ecosystem, but are not considered model hyperparameters. :param hparams: A dictionary of hyperparameters that are used by the model. **do_load_data** (dataset_map: dict) Instructs the container to load training, testing, or generation data (or at least record in memory where the data is) from the supplied paths. :param dataset_map: A dictionary that maps string keys {train, test, generation} to a MistkDataset object that contains information on the dataset to load. The `objectives` values for this model determines which keys are present in the map (i.e. a model with only a training objective should not depend on a 'test' key to be present). The MistkDataset values will require the following fields: object_info: A dictionary that stores metadata information for the dataset. Only the 'name' and 'kind' fields are required, all other fields are optional. data_path: A string containing the path to the dataset root folder modality: A string with value image, audio, video, or text format: A string containing the name of the format of this dataset **do_build_model** (path=None) Instructs the service to build all necessary data structures given the architecture and selected hyperparameters. :param path: The path to the model file or checkpoint that should be loaded. Defaults to None if no file was specified in this model's definition. **do_train** () Perform training with the previously supplied data. **do_predict** () Perform predictions with the previously supplied data. **do_generate** () Perform generations with the previously supplied data. **do_pause** () Pause the current training or testing. **do_resume_training** () Resume the previously paused training. **do_resume_predict** () Resume the previously paused predictions. **do_save_predictions** (dataPath) Save the model predictions to the supplied data path. The predictions should be saved as CSV with each row in the following format: id,label[,confidence][,bounds] :param dataPath: The path to which the predictions should be saved. **do_save_generations** (dataPath) Saves the generated media produced by the model to the supplied data path. :param dataPath: The path to which the generations should be saved. **do_stream_predict** (data_map: dict, details: bool) Perform predictions on the input dict of id's to base64 encoded data and return a dict of id's to predicted values. The underlying format of the base64 encoded data should be the native input format for the model. :param data_map: A dict of id's to base64 encoded data. :param details: Optional parameter for the model to provide additional details in the returned dict. Default value is False. :return: A dict of id's to predicted values. The id 'details' is a keyword used for additional details provided in Markdown or HTML. **do_update_stream_properties** (props: dict) Updates the stream prediction properties for subsequent stream predict calls. :param dict: A dict that can be used for streaming properties. **do_save_model** (path) Save a checkpoint of the model to the supplied data path. Format of the saved file(s) is at the discretion of the model. The infrastructure associates this checkpoint with this model. :param path: The path to which the model should be saved. **do_terminate** () Prepare for application termination. The Model State Machine ----------------------- Model implementations follow a workflow lifecycle based on state machine transitions. The underlying MISTK infrastructure ensures that only legal transitions from one state to another can be made and that the appropriate model methods are called during those transitions. The image below illustrates the high-level state machine transitions and the model methods that may be called between them. .. image:: images/state_machine.png The 'Terminated' state (and associated method 'do_terminate') is not pictured but is a valid transition from any of the states depicted above. Note that an internal 'Failed' state will be entered if the model implementation throws an exception. No further model activities are permitted from the 'Failed' state. If a model workflow enters the 'Failed' state, then the workflow will need to be restarted from the beginning. This is required in order to support various machine learning technologies (ie. Tensorflow) that may not have a methodology to reset/resume their state after a critical failure. Initializing a Model -------------------- Initialization of a model loads the hyperparameters and model properties associated with the model. Hyperparameters are defined by the algorithm that a model is based on. Model properties are defined by a specific implementation of an algorithm (ie. a PyTorch implementation of Densenet may have different properties than a Tensorflow implementation). Below are sample hyperparameters and model properties dictionaries that can be passed to a model implementation's 'do_initialize' method. Hyperparameters ::::::::::::::: { 'learning_rate': 0.001, 'weight_decay': 0.004 'momentum': 0.9 } Model Properties :::::::::::::::: { 'arch': 'densenet', 'num_labels': 1000, 'model_load_file': 'checkpoint.pth' } Loading Data ------------ When loading data into a model instance using the **do_load_data(dataset_map)** method inherited from 'AbstractModel', the dataset_map parameter must include sufficient information for each dataset mapped to an objective {train, test, generate}. The required fields include 'object_info', 'data_path', 'modality', and 'format' (Described in the Model API method documentation previously) An example dataset map is provided below: :: { "train": { "object_info": { "name": "image_training_set", "kind": "MistkDataset" }, "data_path": "/path/to/data", "modality": "image", "format": "jpg" }, "test": { "object_info": { "name": "image_testing_set", "kind": "MistkDataset" }, "data_path": "/path/to/data", "modality": "image", "format": "jpg" }, "generation": { "object_info": { "name": "image_generation_set", "kind": "MistkDataset" }, "data_path": "/path/to/data", "modality": "text", "format": "raw" } } The dataset for generations should contain files that indicate the targets for generation. Please note, that when using the RESTful API directly, where the user would be submitting the JSON via the 'load_data' API call, the JSON keys would need to be in camelcase format rather than underscored (ie. object_info becomes objectInfo). Model Status ------------ Model implementations are encouraged to report status pertaining to their current workflow state back to the MISTK infrastructure by calling the **update_status(dict)** method inherited from `AbstractModel` (note this method is intended to be called by the model, not overridden as the methods above). This method takes a dictionary object of key-value pairs defined by the model. For example, the logistic regression model below updates its status with the number of samples fit during training and the number of samples predicted during testing. Predictions and Ground Truth ---------------------------- Model prediction output from the **do_save_predictions** method as well as corresponding dataset ground truth should be CSV-formatted as follows:: id,label[,confidence][,bounds] The `id` field can be arbitrary as long as predictions and ground truth use the same values. All columns except `id` may contain whitespace-separated values as necessary. The confidence probabilities and bounds values are optional. * Example Prediction CSV line for only id and label: mnist-test-00001,7 * Example Prediction CSV line for id and a single label, confidence, and bounds: ILSVRC2012_val_00000163,n03733805,.9,84 0 433 278 * Example Prediction CSV line for id and a single label, no confidence, and bounds: ILSVRC2012_val_00000163,n03733805,,84 0 433 278 * Example Prediction CSV line for id and multiple (3) label, confidence, and bounds: ILSVRC2012_val_00004833,n03733805 n03733805 n03733805,.9 .8 .8,85 93 350 355 301 306 421 428 0 5 469 373 * Example of *incorrect* Prediction CSV line where addition bounds and confidence will be ignored due to only 1 label: ILSVRC2012_val_00004833,n03733805,.9 .8 .8,85 93 350 355 301 306 421 428 0 5 469 373 Example Model ------- The following example code implements a logistic regression algorithm from scikit-learn with the MISTK model interface which operates on a common public dataset: .. literalinclude:: sample_code_files/logistic_regression.py :linenos: :language: python Transfer learning ----------------- Model support of transfer learning is independent to each model implementation. Model developers that use the MISTK library need to incorporate their implementation features for transfer learning into their build_model and train methods. The PyTorch implementation below builds their model with specific configurations when performing transfer learning. The highlighted lines are relevant lines for transfer learning in this PyTorch implementation. .. literalinclude:: sample_code_files/pytorch_model.py :emphasize-lines: 62-69, 79-82 :linenos: :language: python Model developers will need to specify in their model properties that they are performing transfer learning when using this model for that purpose. Data Transformation Interface ============================= The MISTK API provides an interface for creating data transformation plugins. These plugins run within docker containers similar to models. A MISTK data transform implements the 'mistk.transform.AbstractTransformPlugin' abstract class. This class provides a set of abstract methods that represent the data transform lifecycle and must be implemented by the new data transform plugin. Ultimately, these methods form the core of the web endpoint service that is made available for every data transform implementation. **do_transform** (inputDirs: list, outputDir: string, properties: dict) Performs a data transformation using the data provided in the input directories and stores the resulting dataset(s) in the output directory. :param inputDirs: A list of directory paths from which to load input datasets from. :param outputDir: The directory path where the output dataset will be stored :param props: A dictionary of settings or configuration values that are passed to the transformation. **do_terminate** () Prepare for application termination. The Transformation State Machine -------------------------------- Data transformation plugin implementations follow a workflow lifecycle based on state machine transitions. The underlying MISTK infrastructure ensures that only legal transitions from one state to another can be made and that the appropriate model methods are called during those transitions. The image below illustrates the high-level state machine transitions and the data transform plugin methods that may be called between them. .. image:: images/transform_state_machine.png A data transform plugin instance will automatically go to the Ready state if its container instance successfully starts up. After completing a transformation, it will return to the Ready state so that multiple transformations can be executed using the same container instance. Example ------- The following example code implements a ground truth transform for an MNIST formatted dataset with the MISTK data transformation interfacet: .. literalinclude:: sample_code_files/groundtruth_mnist_plugin.py :linenos: :language: python Evaluation Interface ==================== The MISTK API provides an interface for creating evaluation plugins to evaluate models against specific metrics. These plugins run within docker containers similar to models. A MISTK evaluation implements the 'mistk.evaluation.AbstractEvaluationPlugin' abstract class. This class provides a set of abstract methods that represent the evaluation lifecycle and must be implemented by the new evaluation plugin. Ultimately, these methods form the core of the web endpoint service that is made available for every evaluation implementation. **do_evaluate** (assessment_type: string, metrics: list, input_data_path: string, evaluation_input_format: string, ground_truth_path: string, evaluation_path: string, properties: dict) Performs an evaluation using the ground truth data and predictions data for the assessment type and metrics specified and stores the resulting evaluation in the output directory. :param assessment_type: The evaluation type. One of {'BinaryClassification', 'MultilabelClassification', 'MulticlassClassification', 'Regression'} :param metrics: Specific metrics to evaluate against instead of all metrics defined by assessment_type :param input_data_path: Path to input data for the evaluation :param evaluation_input_format: The format of the input data. One of {predictions, generations} :param ground_truth_path: The directory path where the ground_truth.csv file is located :param evaluation_path: A directory path to where the evaluation.json output file will be stored :param properties: A dictionary of key value pairs for evaluation plugin arguments. **do_terminate** () Prepare for application termination. Evaluation Metrics ------------------ Metrics are loaded for the evaluation container based on the metrics.json file. This JSON file contains a list of Metric objects to describe the metric and how it will be executed. The Metric values will require the following fields: object_info: A dictionary that stores metadata information for the dataset. Only the 'name' and 'kind' fields are required, all other fields are optional. image_id: The image id for the evaluation container to run. assessment_types: The assessment type for the metric, e.g. MulticlassClassification. An example for a metric: :: { "objectInfo": { "kind": "Metric", "name": "sklearn.metrics.r2_score" }, "imageId": "docker-registry:5000/sml-evaluators/sklearn", "assessmentTypes": ["Regression"] } The Metric values can optionally contain the following fields for accessing external metrics from other Python packages: package: (str) The name of the package containing the implementation of this external metric. method: (str) The name of the method from `package` to be called when executing the external metric. defaultArgs: (dict) The default arguments passed to the method when the metric is called. These key/value pairs can also be used as properties for the metric. An example for an external metric: :: { "objectInfo": { "kind": "Metric", "name": "r2_score" }, "imageId": "docker-registry:5000/sml-evaluators/sklearn", "assessmentTypes": ["Regression"], "package": "sklearn.metrics", "method": "r2_score", "defaultArgs": { "multioutput": "uniform_average" } } The Evaluation State Machine ---------------------------- Evaluation plugin implementations follow a workflow lifecycle based on state machine transitions. The underlying MISTK infrastructure ensures that only legal transitions from one state to another can be made and that the appropriate model methods are called during those transitions. The image below illustrates the high-level state machine transitions and the evaluation plugin methods that may be called between them. .. image:: images/evaluation_state_machine.png A evaluation plugin instance will automatically go to the Started state if its container instance successfully starts up. After completing a evaluation, it will return to the Ready state so that multiple evaluations can be executed using the same container instance. Example ------- The following example code implements a sklearn evaluation metrics with the MISTK evaluation metrics interfacet: .. literalinclude:: sample_code_files/sklearn_evaluation.py :linenos: :language: python