Skip to content

Reproducible-ml#

End-to-end reproducible Machine Learning pipelines on Kubernetes

This is Part 3 - End-to-end reproducible Machine Learning pipelines on Kubernetes of technical blog series titled [Reproducibility in Machine Learning]. Part 1 & Part 2 can be found here & here respectively.


Change Anything Changes Everything (CAKE) principle -[Scully et al][scully_2015] is real in ML. Also, 100% reproducible ML code comes at a cost of speed - a non-negotiable aspect in today's time. If we cannot be 100% and change is evident, then the only way to maintaining explainability, understanding, trust & confidence is through version control everything.

Figure 1: Version control explained by XKCD

In this post, we will be looking at building an end to end fully automated ML pipeline that can maintain full provenance across the entire ML system. In my view, a standard machine learning workflow looks like the one below:

Figure 2: Machine Learning end to end system

So we will be working towards building this system with full provenance over it. For this, we will be extending our sample semantic segmentation example based on Oxford Pet dataset

To build this ML workflow, we will be using Kubernetes - a container orchestration platform. On top of Kubernetes, we will be running Pachyderm software that will do the heavy lifting of maintaining provenance across data, artifacts, and ml processes.

What to version control

In part 1 of this blog series, we discussed above the challenges, shown in figure 3, in realizing reproducible ML.

Figure 3: Overview of challenges in reproducible ML

The presence of these challenges in the system-wide view of ML is shown in figure 4.

Figure 4: What to version control?

But first, let's talk about creating the environment, infrastructure, and versioning it.

1. Versioning environment

Using gitops environment and any changes associated with it can be version controlled. In this sample, we will be using ArgoCD to implement GitOps workflow (figure 5) which will see our environment config moved to be alongside the code repository (figure 6).

Figure 5: Gitops

This is achieved by defining argo apps which can be applied on BYO Kubernetes cluster (version 1.14.7) that has ArgoCD installed:

kubectl apply –f https://raw.githubusercontent.com/suneeta-mall/e2e-ml-on-k8s/master/cluster-conf/e2e-ml-argocd-app.yaml

Figure 6: Gitops on environment config

Once the Argo apps are created, the following software will be installed on the cluster: - Kubernetes: 1.14.7 (tested on this version, in theory, should work with other versions too!) - ArgoCD: 1.2.3 - Kubeflow: 0.6.2

Kubeflow is an ML toolkit designed to bring a variety of ML related Kubernetes based software together. - Seldon 0.4.1 (upgraded from packaged version on kubeflow 0.6.2) Seldon is a model serving software - Pachyderm: 1.9.7 Pachyderm offers a git like a repository that can hold data even big data. It also offers automated repository capability that act on input and generate data thus holding a versioned copy of this generated data. Together with these constructs, it can be used to create pipeline DAG like processes with provenance across graph input, transformation spec, output

Any change on this configuration repository will then trigger a cluster update keeping environment in synced with versioned config.

2. Versioning data, process, and artifacts

Figure 7: Artifact view of Machine Learning end to end system (shown in figure 2)

Pachyderm pipeline specification for an end to end ML workflow capability shown in figure 2 is available here. The generated artifacts/data as a result of this ML workflow is shown in figure 7 above. These artifacts and their association with other processes are also highlighted in figure 7.

---
pipeline:
  name: stream
transform:
  image: suneetamall/pykubectl:1.14.7
  cmd:
  - "/bin/bash"
  stdin:
  - "wget -O images.tar.gz https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz && \
     wget -O annotations.tar.gz https://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz && \
     tar -cvf data.tar.gz *.tar.gz && \
     cat data.tar.gz > /pfs/out && \
     while :; do sleep 2073600; done"
spout:
  overwrite: true
---
input:
  pfs:
    glob: /
    repo: stream
pipeline:
  name: warehouse
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python download_petset.py --input /pfs/stream/ --output /pfs/out"
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: warehouse
pipeline:
  name: transform
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python dataset_gen.py --input /pfs/warehouse --output /pfs/out"
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: transform
pipeline:
  name: train
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python train.py --input /pfs/transform --output /pfs/out --checkpoint_path /pfs/out/ckpts --tensorboard_path /pfs/out"
resource_requests:
  memory: 2G
#  gpu:
#    type: nvidia.com/gpu
#    number: 1
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: transform
pipeline:
  name: tune
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python tune.py --input /pfs/transform --output /pfs/out"
resource_requests:
  memory: 4G
  cpu: 1
#  gpu:
#    type: nvidia.com/gpu
#    number: 1
datum_tries: 2
#standby: true
---
input:
  cross:
    - pfs:
       glob: "/"
       repo: transform
    - pfs:
        glob: "/optimal_hp.json"
        repo: tune
pipeline:
  name: model
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python train.py --input /pfs/transform --hyperparam_fn_path /pfs/tune/optimal_hp.json
     --output /pfs/out --checkpoint_path /pfs/out/ckpts --tensorboard_path /pfs/out"
  - "ln -s /pfs/tune/optimal_hp.json /pfs/out/optimal_hp.json"
resource_requests:
  memory: 2G
#  gpu:
#    type: nvidia.com/gpu
#    number: 1
datum_tries: 2
#standby: true
---
input:
  cross:
    - pfs:
       glob: "/calibration"
       repo: transform
    - pfs:
        glob: "/model.h5"
        repo: model
pipeline:
  name: calibrate
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python calibrate.py --input /pfs/transform --model_weight /pfs/model/model.h5 --output /pfs/out"
  - "ln -s /pfs/model/model.h5 /pfs/out/model.h5"
datum_tries: 2
#standby: true
---
input:
  cross:
    - pfs:
       glob: "/test"
       repo: transform
    - pfs:
        glob: "/"
        repo: calibrate
pipeline:
  name: evaluate
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "papermill evaluate.ipynb /pfs/out/Report.ipynb \
      -p model_weights /pfs/calibrate/model.h5 \
      -p calibration_weights /pfs/calibrate/calibration.weights \
      -p input_data_dir /pfs/transform \
      -p out_dir /pfs/out \
      -p hyperparameters /pfs/calibrate/optimal_hp.json"
  - "ln -s /pfs/calibrate/model.h5 /pfs/out/model.h5"
  - "ln -s /pfs/calibrate/calibration.weights /pfs/out/calibration.weights"
resource_requests:
  memory: 1G
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: evaluate
pipeline:
  name: release
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python release.py --model_db evaluate --input /pfs/evaluate/evaluation_result.csv --version ${evaluate_COMMIT}"
pod_spec: '{"serviceAccount": "ml-user", "serviceAccountName": "ml-user"}'
datum_tries: 2
#standby: true
---
## Service https://docs.pachyderm.io/en/1/concepts/pipeline-concepts/pipeline/service.html
input:
  pfs:
    glob: "/"
    repo: model
pipeline:
  name: tensorboard
service:
  external_port: 30888
  internal_port: 6006
transform:
  cmd:
  - "/bin/bash"
  stdin:
  - tensorboard --logdir=/pfs/model/
  image: suneetamall/e2e-ml-on-k8s:1
---

This pipeline creates ML workflow, with artifact dependency shown in above figure 7, wherein full provenance across data, processes, and outcomes are maintained along with respective lineage.


This is the last post of the technical blog series, [Reproducibility in Machine Learning].

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

[scully_2015]: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-Machine Learning-systems.pdf

Replicability - an extension to reproducibility

This is Part 3 - End-to-end reproducible Machine Learning pipelines on Kubernetes of technical blog series titled [Reproducibility in Machine Learning]. Part 1 & Part 2 can be found here & here respectively.

Part 1: Reproducibility in Machine Learning - Research and Industry

Replicability and

The research community is quite divided when it comes to defining reproducibility and often mixes it up with replicability

Figure 4: Replicability defined

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

Realizing reproducible Machine Learning - with Tensorflow

This is Part 2 - Realizing reproducible Machine Learning - with Tensorflow of technical blog series titled [Reproducibility in Machine Learning]. Part 1 & Part 3 can be found here & here respectively.


As discussed in Part 1, writing reproducible machine learning is not easy with challenges arising from every direction e.g. hardware, software, algorithms, process & practice, data. In this post, we will focus on what is needed to ensure ML code is reproducible.

First things first

There are a few very simple things that are needed to be done before thinking big and focussing on writing reproducible ML code. These are:

  • Code is version controlled

Same input (data), same process (code) resulting in the same output is the essence of reproducibility. But code keeps evolving, since ML is so iterative. Hence, it's important to version control code (including configuration). This allows obtaining the same i.e. exact code (commit/version) from the source repository.

  • Reproducible runtime – pinned libraries

So we version-controlled code but what about environment/runtime? Sometimes, non-determinism is introduced not direct by user code but also dependencies. We talked about this at length in the software section of Challenges in realizing reproducible ML in Part 1. Taking the example of Pyproj - a geospatial transform library, that I once used to compute geo-location-based on some parameters. We changed nothing but just a version of pyproj from V1.9.6 to V2.4.0 and suddenly all our calculations were giving different results. The difference was so much that location calculation for San Diego Convention Centre was coming out to be somewhere in Miramar off golf course (see figure 1) issue link. Now imagine ordering pizza delivery on the back of my computation snippet backed with an unpinned version of pyproj?

Figure 1: Example of why pinned libraries are important

Challenges like these occur quite often that we would like. That's why it's important to pin/fix dependent runtime either by pinning version of libraries or using versioned containers (like docker).

  • Smart randomness

As discussed in Part 1, randomization plays a key role in most ML algorithms. Unseeded randomness is the simplest way to make code non-reproducible. It is also one of the easiest to manage amongst all things gotchas of non-reproducible ML. All we need to do is seed all the randomness and manage the seed via configuration (as code or external).

  • Rounding precision, under-flows & overflow

Floating point arithmetic is ubiquitous in ML. The complexity and intensity of floating-point operations (FLOPS) are increasing every day with current needs easily meeting Giga-Flops order of computations. To achieve efficiency in terms of speed despite the complexity, mixed precision floating-point operations have also been proposed. As discussed Part 1, accelerated hardware such as (GPGPU), tensor processing unit (TPU), etc. due to their architecture and asynchronous computing do not guarantee reproducibility. Besides, when dealing with floating points, the issues related to overflow and underflow are expected. This just adds to the complexity.

  • Dependent library’s behavior aware

As discussed in the software section of Challenges in realizing reproducible ML in Part 1, some routines of ML libraries do not guarantee reproducibility. For instance, NVIDIA's CUDA based deep learning library cudnn. Similarly, with Tensorflow, using some methods may result in non-deterministic behavior. One such example is backward pass of broadcasting on GPUref. Awareness about behaviors of libraries being used and approaches to overcome the non-determinism should be explored.

Writing reproducible ML

To demonstrate reproducible ML, I will be using Oxford Pet dataset that has labels for pet images. I will be doing a semantic segmentation of pets images and will be using pixel-level labeling. Each pixel of the pet image in Oxford Pet dataset is labeled as 1) Foreground (Pet area), 2) Background (not pet area) and 3) Unknown (edges). These labels by definition are mutually exclusive - i.e. a pixel can only be one of the above 3.

Figure 2: Oxford pet dataset

I will be using a convolution neural network (ConvNet) for semantic segmentation. The network architecture is based on U-net. This is similar to standard semantic segmentation example by tensorflow.

Fihure 3: U-net architecture

The reproducible version of semantic segmentation is available in Github repository. This example demonstrates reproducible ML and also performing end to end ML with provenance across process and data.

Figure 4: Reproducible ML sample - semantic segmentation of oxford pet

In this post, however, I will be discussing only the reproducible ML aspect of it and will be referencing snippets of this example.

ML workflow

In reality, a machine learning workflow is very complex and looks somewhat similar to figure 5. In this post, however, we will only discuss the data and model training part of it. The remaining workflow i.e. the end-to-end workflow will be discussed in next post.

Figure 5: Machine learning workflow

Data

The source dataset is Oxford Pet dataset which contains a multitude of labels e.g. class outcome, pixel-wise label, bounding boxes, etc. The first step is to process this data to generate the trainable dataset. In the example code, this is done by download_petset.py script.

python download_petset.py  --output /wks/petset
The resulting sample is shown in figure 6.

Figure 6: Pets data partitioned by Pets ID

Post data partition, the entire dataset is divided into 4 sets: a) training, b) validation, c) calibration, and d) test We would want this set partitioning strategy to be reproducible. By doing this, we ensure that if we have to blow away the training dataset, or if accidental data loss occurs then the exact dataset can be created.

In this sample, this is achieved by generating the hash of petid and partitioning the hash into 10 folds (script below) to obtain partition index of pet id.

partition_idx = int(hashlib.md5((os.path.basename(petset_id)).encode()).hexdigest(), 16) % 10

With partition_idx 0-6 assigned for training, 7 for validation, 8 for calibration, and 9 for the test, every generation will result in pets going into their respective partition.

Besides, to set partitioning, any random data augmentation performed is seeded with seed controlled as configuration as code. See tf.image.random_flip_left_right used in this tensorflow data pipeline method.

Script for model dataset preparation is located in dataset_gen.py and used as following

python dataset_gen.py --input /wks/petset --output /wks/model_dataset
with results shown below:

Figure 7: Pets data partitioned into training, validation, calibration, and test set

Modelling semantic segmentation

The model for pet segmentation is based on U-net with backbone of either MobileNet-v2 or VGG-19 (defaults to VGG-19). As per this model's network architecture, 5 activation layers of pre-trained backbone network are chosen. These layers are: * MobileNet

'block_1_expand_relu' 
'block_3_expand_relu'  
'block_6_expand_relu' 
'block_13_expand_relu'
'block_16_project'
  • VGG
    'block1_pool'
    'block2_pool'
    'block3_pool'
    'block4_pool'
    'block5_pool'
    

Each of these layers is then concatenated with a corresponding upsampling layer comprising of Conv2DTranspose layer forming what's known as skip connection. See model code for more info.

The training script train.py can be used as following:

python train.py --input /wks/model_dataset --output /wks/model --checkpoint_path /wks/model/ckpts --tensorboard_path /wks/model
1. Seeding randomness

All methods exploiting randomness is used with appropriate seed.
- All random initialization is seeded e.g. tf.random_normal_initializer - All dropout layers are seeded e.g. dropout

Global seed is set for any hidden methods that may be using randomness by calling in set_seed(seed) which sets seed for used libraries:

def set_seeds(seed=SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
2. Handing library behaviors
2.1 CuDNN

CuDNN does not guarantee reproducibility in some of its routines. Environment variable TF_DETERMINISTIC_OPS & TF_CUDNN_DETERMINISTIC can be used to control this behavior as per this snippet (figure 8) from cudnn release page.

Figure 8: NVIDIA release page snippet reproducibility

2.2 CPU thread parallelism

As discussed in Part 1, using high parallelism with compute intensive workflow may not be reproducible. Configurations for interref and intraref operation parallelism should be set if 100% parallelism is desired.

In this example, I have chosen 1 to avoid any non-determinism arising from inter-operation parallelism. Warning setting this will considerably slow down the training.

tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
2.3 Tensorflow determinism

Following are some of the application of Tensorflow that are not reproducible: - Backward pass of broadcasting on GPU is non-deterministiclink - Mention that GPU reductions are nondeterministic in docslink - Problems Getting TensorFlow to behave Deterministicallylink

Duncan Riach, along with several other contributors have created tensorflow_determinism package that can be used to overcome non-reproducibility related challenges from TensorFlow. It should be used in addition to the above measures we have discussed so far.

If we combine all the approaches discussed above (aside from using seeded randomness), they can be wrapped into a lightweight method like one below:

def set_global_determinism(seed=SEED, fast_n_close=False):
    """
        Enable 100% reproducibility on operations related to tensor and randomness.
        Parameters:
        seed (int): seed value for global randomness
        fast_n_close (bool): whether to achieve efficient at the cost of determinism/reproducibility
    """
    set_seeds(seed=seed)
    if fast_n_close:
        return

    logging.warning("*******************************************************************************")
    logging.warning("*** set_global_determinism is called,setting full determinism, will be slow ***")
    logging.warning("*******************************************************************************")

    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
    # https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads
    tf.config.threading.set_inter_op_parallelism_threads(1)
    tf.config.threading.set_intra_op_parallelism_threads(1)
    from tfdeterminism import patch
    patch()
which can then be used on top of the ML algorithm/process code to generate a 100% reproducible code.

Result

What happens if we don't write reproducible ML? What kind of difference we are really talking about? The last two columns of figure 9 show results obtained by a model trained on the exact same dataset, exactly the same code with EXACTLY one exception. The dropout layer used in the network were unseeded. All other measures discussed above were taken into account.

Figure 9: Effect of just forgetting to set one seed amidst many

Looking at the result of the first pet which is a very simple case, we can see the subtle difference in the outcome of these two models. The second pet case is slightly complicated due to shadow and we can see obvious differences in the outcome. But what about the third case, this is a very hard case for pre-trained frozen backbone model we are using but we can see major differences in result between the two models.

If we were to use all the measures discussed above then 100% reproducible ML can be obatined. This is shown in the following 2 logs obtained by running the following:

python train.py \
  --input /wks/model_dataset \
  --hyperparam_fn_path best_hyper_parameters.json \
  --output logs \
  --checkpoint_path "logs/ckpts" \
  --tensorboard_path logs
wherein content of best_hyper_parameters.json are:
{
   "batch_size":60,
   "epochs":12,
   "iterations":100,
   "learning_rate":0.0018464290366223407,
   "model_arch":"MobileNetV2",
   "steps_per_epoch":84
}
Run attempt 1:
WARNING:root:******* set_global_determinism is called, setting seeds and determinism *******
TensorFlow version 2.0.0 has been patched using tfdeterminism version 0.3.0
Input: tf-data, Model: MobileNetV2, Batch Size: 60, Epochs: 12, Learning Rate: 0.0018464290366223407, Steps Per Epoch: 84
Train for 84 steps, validate for 14 steps
Epoch 1/12
2019-11-07 12:48:27.576286: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
84/84 [==============================] - 505s 6s/step - loss: 0.8187 - iou_score: 0.4867 - f1_score: 0.6092 - binary_accuracy: 0.8749 - val_loss: 1.0283 - val_iou_score: 0.4910 - val_f1_score: 0.6297 - val_binary_accuracy: 0.8393
Epoch 2/12
84/84 [==============================] - 533s 6s/step - loss: 0.6116 - iou_score: 0.6118 - f1_score: 0.7309 - binary_accuracy: 0.9150 - val_loss: 0.6965 - val_iou_score: 0.5817 - val_f1_score: 0.7079 - val_binary_accuracy: 0.8940
Epoch 3/12
84/84 [==============================] - 527s 6s/step - loss: 0.5829 - iou_score: 0.6301 - f1_score: 0.7466 - binary_accuracy: 0.9197 - val_loss: 0.6354 - val_iou_score: 0.6107 - val_f1_score: 0.7312 - val_binary_accuracy: 0.9038
Epoch 4/12
84/84 [==============================] - 503s 6s/step - loss: 0.5733 - iou_score: 0.6376 - f1_score: 0.7528 - binary_accuracy: 0.9213 - val_loss: 0.6192 - val_iou_score: 0.6227 - val_f1_score: 0.7411 - val_binary_accuracy: 0.9066
Epoch 5/12
84/84 [==============================] - 484s 6s/step - loss: 0.5566 - iou_score: 0.6461 - f1_score: 0.7599 - binary_accuracy: 0.9241 - val_loss: 0.5827 - val_iou_score: 0.6381 - val_f1_score: 0.7534 - val_binary_accuracy: 0.9156
Epoch 6/12
84/84 [==============================] - 509s 6s/step - loss: 0.5524 - iou_score: 0.6497 - f1_score: 0.7629 - binary_accuracy: 0.9247 - val_loss: 0.5732 - val_iou_score: 0.6477 - val_f1_score: 0.7605 - val_binary_accuracy: 0.9191
Epoch 7/12
84/84 [==============================] - 526s 6s/step - loss: 0.5439 - iou_score: 0.6544 - f1_score: 0.7669 - binary_accuracy: 0.9262 - val_loss: 0.5774 - val_iou_score: 0.6456 - val_f1_score: 0.7590 - val_binary_accuracy: 0.9170
Epoch 8/12
84/84 [==============================] - 523s 6s/step - loss: 0.5339 - iou_score: 0.6597 - f1_score: 0.7710 - binary_accuracy: 0.9279 - val_loss: 0.5533 - val_iou_score: 0.6554 - val_f1_score: 0.7672 - val_binary_accuracy: 0.9216
Epoch 9/12
84/84 [==============================] - 518s 6s/step - loss: 0.5287 - iou_score: 0.6620 - f1_score: 0.7730 - binary_accuracy: 0.9288 - val_loss: 0.5919 - val_iou_score: 0.6444 - val_f1_score: 0.7584 - val_binary_accuracy: 0.9148
Epoch 10/12
84/84 [==============================] - 506s 6s/step - loss: 0.5259 - iou_score: 0.6649 - f1_score: 0.7753 - binary_accuracy: 0.9292 - val_loss: 0.5532 - val_iou_score: 0.6554 - val_f1_score: 0.7674 - val_binary_accuracy: 0.9218
Epoch 11/12
84/84 [==============================] - 521s 6s/step - loss: 0.5146 - iou_score: 0.6695 - f1_score: 0.7789 - binary_accuracy: 0.9313 - val_loss: 0.5586 - val_iou_score: 0.6581 - val_f1_score: 0.7689 - val_binary_accuracy: 0.9221
Epoch 12/12
84/84 [==============================] - 507s 6s/step - loss: 0.5114 - iou_score: 0.6730 - f1_score: 0.7818 - binary_accuracy: 0.9317 - val_loss: 0.5732 - val_iou_score: 0.6501 - val_f1_score: 0.7626 - val_binary_accuracy: 0.9179

Run attempt 2:

WARNING:root:******* set_global_determinism is called, setting seeds and determinism *******
TensorFlow version 2.0.0 has been patched using tfdeterminism version 0.3.0
Input: tf-data, Model: MobileNetV2, Batch Size: 60, Epochs: 12, Learning Rate: 0.0018464290366223407, Steps Per Epoch: 84
Train for 84 steps, validate for 14 steps

Epoch 1/12
2019-11-07 10:45:51.549715: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
84/84 [==============================] - 549s 7s/step - loss: 0.8187 - iou_score: 0.4867 - f1_score: 0.6092 - binary_accuracy: 0.8749 - val_loss: 1.0283 - val_iou_score: 0.4910 - val_f1_score: 0.6297 - val_binary_accuracy: 0.8393
Epoch 2/12
84/84 [==============================] - 515s 6s/step - loss: 0.6116 - iou_score: 0.6118 - f1_score: 0.7309 - binary_accuracy: 0.9150 - val_loss: 0.6965 - val_iou_score: 0.5817 - val_f1_score: 0.7079 - val_binary_accuracy: 0.8940
Epoch 3/12
84/84 [==============================] - 492s 6s/step - loss: 0.5829 - iou_score: 0.6301 - f1_score: 0.7466 - binary_accuracy: 0.9197 - val_loss: 0.6354 - val_iou_score: 0.6107 - val_f1_score: 0.7312 - val_binary_accuracy: 0.9038
Epoch 4/12
84/84 [==============================] - 515s 6s/step - loss: 0.5733 - iou_score: 0.6376 - f1_score: 0.7528 - binary_accuracy: 0.9213 - val_loss: 0.6192 - val_iou_score: 0.6227 - val_f1_score: 0.7411 - val_binary_accuracy: 0.9066
Epoch 5/12
84/84 [==============================] - 534s 6s/step - loss: 0.5566 - iou_score: 0.6461 - f1_score: 0.7599 - binary_accuracy: 0.9241 - val_loss: 0.5827 - val_iou_score: 0.6381 - val_f1_score: 0.7534 - val_binary_accuracy: 0.9156
Epoch 6/12
84/84 [==============================] - 494s 6s/step - loss: 0.5524 - iou_score: 0.6497 - f1_score: 0.7629 - binary_accuracy: 0.9247 - val_loss: 0.5732 - val_iou_score: 0.6477 - val_f1_score: 0.7605 - val_binary_accuracy: 0.9191
Epoch 7/12
84/84 [==============================] - 506s 6s/step - loss: 0.5439 - iou_score: 0.6544 - f1_score: 0.7669 - binary_accuracy: 0.9262 - val_loss: 0.5774 - val_iou_score: 0.6456 - val_f1_score: 0.7590 - val_binary_accuracy: 0.9170
Epoch 8/12
84/84 [==============================] - 514s 6s/step - loss: 0.5339 - iou_score: 0.6597 - f1_score: 0.7710 - binary_accuracy: 0.9279 - val_loss: 0.5533 - val_iou_score: 0.6554 - val_f1_score: 0.7672 - val_binary_accuracy: 0.9216
Epoch 9/12
84/84 [==============================] - 518s 6s/step - loss: 0.5287 - iou_score: 0.6620 - f1_score: 0.7730 - binary_accuracy: 0.9288 - val_loss: 0.5919 - val_iou_score: 0.6444 - val_f1_score: 0.7584 - val_binary_accuracy: 0.9148
Epoch 10/12
84/84 [==============================] - 531s 6s/step - loss: 0.5259 - iou_score: 0.6649 - f1_score: 0.7753 - binary_accuracy: 0.9292 - val_loss: 0.5532 - val_iou_score: 0.6554 - val_f1_score: 0.7674 - val_binary_accuracy: 0.9218
Epoch 11/12
84/84 [==============================] - 495s 6s/step - loss: 0.5146 - iou_score: 0.6695 - f1_score: 0.7789 - binary_accuracy: 0.9313 - val_loss: 0.5586 - val_iou_score: 0.6581 - val_f1_score: 0.7689 - val_binary_accuracy: 0.9221
Epoch 12/12
84/84 [==============================] - 483s 6s/step - loss: 0.5114 - iou_score: 0.6730 - f1_score: 0.7818 - binary_accuracy: 0.9317 - val_loss: 0.5732 - val_iou_score: 0.6501 - val_f1_score: 0.7626 - val_binary_accuracy: 0.9179

So we have 100% reproducible ML code now but saying training is snail-ish is an understatement. Training time has increased (CPU based measures) from 28 minutes vs 1 hr 45 minutes as we give away with inter-thread parallelism and also asynchronous computation optimization. This is not practical in reality. This is also why reproducibility in ML is more focussed around having a road map to reach the same conclusions- Dodge. This is realized by maintaining a system capable of capturing full provenance over everything involved in the ML process including data, code, processes, and infrastructure/environment. This will be the focus of part 3 of this blog series.


The next part of the technical blog series, [Reproducibility in Machine Learning], is End-to-end reproducible Machine Learning pipelines on Kubernetes.

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

Reproducibility in Machine Learning - Research and Industry

This is Part 1 - Reproducibility in Machine Learning - Research and Industry of technical blog series titled [Reproducibility in Machine Learning]. Part 2 & Part 3 can be found here & here respectively.


Machine learning (ML) is an interesting field aimed at solving problems that can not be solved by applying deterministic logic. In fact, ML solves problems in logits [0, 1] with probabilities! ML is a highly iterative and fiddly field with much of its intelligence derived from data upon application of complex mathematics. Sometimes, even a slight change such as changing the order of input/data can change the outcome of ML processes drastically. Actually xkcd quite aptly puts it:

Figure 1: Machine Learning explained by XKCD

This phenomenon is explained as Change Anything Changes Everything a.k.a. CAKE principle coined by Scully et. al in their NIPS 2015 paper titled ["Hidden Technical Debt in Machine Learning Systems"][scully_2015]. CAKE principle highlights that in ML - no input is ever really independent.

What is reproducibility in ML

Reproducibility as per the Oxford dictionary is defined as something that can be produced again in the same way. Figure 2: Reproducible defined

In ML context, it relates to getting the same output on the same algorithm, (hyper)parameters, and data on every run.

To demonstrate, let's take a simple linear regression example (shown below) on Scikit Diabetes Dataset. Linear regression is all about fitting a line i.e. Y = a + bX over data-points represented as X, with b being the slope and a being the intercept.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

diabetes = datasets.load_diabetes()    
diabetes_X = diabetes.data[:, np.newaxis, 9]
xtrain, xtest, ytrain, ytest = train_test_split(diabetes_X, diabetes.target, test_size=0.33)
regr = linear_model.LinearRegression()
regr.fit(xtrain, ytrain)
diabetes_y_pred = regr.predict(xtest)

# The coefficients 
print(f'Coefficients: {regr.coef_[0]}\n'
      f'Mean squared error: {mean_squared_error(ytest, diabetes_y_pred):.2f}\n'
      f'Variance score: {r2_score(ytest, diabetes_y_pred):.2f}')
# Plot outputs 
plt.scatter(xtest, ytest,  color='green')
plt.plot(xtest, diabetes_y_pred, color='red', linewidth=3)
plt.ylabel('Quantitative measure of diabetes progression')
plt.xlabel('One of six blood serum measurements of patients')
plt.show()
A linear regression example on Scikit Diabetes Dataset

Above ML code is NOT reproducible. Every run will give different results: a) The data distribution will vary and b) Obtained slop and intercept will vary. See Figure 3.

Figure 3: Repeated run of above linear regression code produces different results

In the above example, we are using the same dataset, same algorithm, same hyper-parameters. So why are we getting different results? Here the method train_test_split splits the diabetes dataset into training and test but while doing so, it performs a random shuffle of the dataset. The seed for this random shuffle is not set here. Because of this, every run produces different training dataset distribution. Due to this, the regression line slope and intercept ends up being different. In this simple example, if we were to set a random state for method train_test_split e.g. random_state=42 then we will have a reproducible regression example over the diabetes dataset. The reproducible version of the above regression example is as follows:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

diabetes = datasets.load_diabetes()    
diabetes_X = diabetes.data[:, np.newaxis, 9]
xtrain, xtest, ytrain, ytest = train_test_split(diabetes_X, diabetes.target, test_size=0.33,
                                                random_state=42)
regr = linear_model.LinearRegression()
regr.fit(xtrain, ytrain)
diabetes_y_pred = regr.predict(xtest)

# The coefficients 
print(f'Coefficients: {regr.coef_[0]}\n'
      f'Mean squared error: {mean_squared_error(ytest, diabetes_y_pred):.2f}\n'
      f'Variance score: {r2_score(ytest, diabetes_y_pred):.2f}')
# Plot outputs 
plt.scatter(xtest, ytest,  color='green')
plt.plot(xtest, diabetes_y_pred, color='red', linewidth=3)
plt.ylabel('Quantitative measure of diabetes progression')
plt.xlabel('One of six blood serum measurements of patients')
plt.show()
A reproducible linear regression example on Scikit Diabetes Dataset

Seeding random state is not the only challenge in writing reproducible ML. In fact, there are several reasons why reproducibility in ML is so hard to achieve. But I will go into that a bit later in section Challenges in realizing reproducible ML. The first question should be "why reproducibility matters in ML"?

Importance of reproducibility in ML

Non-reproducible single occurrences are of no significance to science. - Popper (The Logic of Scientific Discovery)

The importance of reproducibility is increasingly getting recognized since Nature's Survey (2016) reported a reproducibility crisis. As per this survey report, 70% of researchers have failed to reproduce another scientist's experiments, and more than 50% have failed to reproduce their own experiments. With more than half of participating scientist agreeing to the presence of reproducibility crisis is indeed very real. Dr. Joelle Pineau, an Associate Professor at McGill University and lead for Facebook’s Artificial Intelligence Research lab, covered the reproducibility crisis in her talk at International Conference on Learning Representations (ICLR) 2018 you tube. She is determined to nip
this crisis in bud from AI researchsrc. It's not just her, several AI research groups are coming up with measures to ensure reproducibility (example below): - Model Card at Google - Reproducibility Checklist at NeurIPS - ICLR Reproducibility Challenge at ICLR - Show your work at Allen Institute of Artificial Intelligence

Aside from being of no use if can't be reproduced, as Popper suggested in the above quote, why does reproducibility matter?

1. Understanding, Explaining, Debugging, and Reverse Engineering

Reproducibility helps with understanding, explaining, and debugging. Reproducibility is also a crucial means to reverse engineering.

Machine learning is inherently difficult to explain, understand, and also debug. Obtaining different output on the subsequent run just makes this whole understanding, explaining, debugging thing all the more challenging. How do we ever reverse engineer? As it is, understanding and explaining are hard with machine learning. It's increasingly harder with deep learning. For over a decade, researches are have been trying to understand what these deep networks learn and yet have not 100% succeeded in doing so.

From visualizing higher layer features of deep networks year 2009 to activation-atlases i.e. what individual neurons in deep network do year 2017 to understanding how deep networks decides year 2018 - are all ongoing progressive efforts towards understanding. Meanwhile, explainability has morphed into a dedicated field 'Explainable Artificial Intelligence XAI.

2. Correctness

If anything can go wrong, it will -Murphy's law

Correctness is important as Murphy's law rarely fails us. These are some of the examples of great AI failures of our times.
Figure 4: Example of some of the great AI failures of our times

Google Photos launched AI capabilities with automatically tagging images. It was found to be tagging people of dark skin as gorillas. Amazon's recruiting software exhibiting gender bias or even IBM's Watson giving unsafe recommendations for cancer treatment.

ML output should be correct in addition to being explainable. Reproducibility helps to achieve correctness through understanding and debugging.

3. Credibility

ML output must be credible. It's not just from a fairness, ethical viewpoint but also because they sometimes impact lives (e.g. mortgage approval). Also, end-users of ML output expect answers to verifiable, reliable, unbiased, and ethical. As Lecun said in his International Solid State Circuit Conference in San Francisco, 2019 keynote:

Good results are not enough, Making them easily reproducible also makes them credible. - Lecun, ISSCC 2019

4. Extensibility

Reproducibility in preceding layers is needed to build out and extend. Can we build a building outline model if we cant repeatedly generate roof semantics as shown in figure 5? What if we keep getting the different size for the same roof? Figure 5: Extending ML

Extensibility is essential to utilizing ML outputs for consumption. As it is, raw outputs from ML are rarely usable by end-user. Most ML outputs need to be post-processed and augmented to be consumption ready.

4. Data harvesting

The world’s most valuable resource is no longer oil, but data! - economist.com

To train a successful ML algorithm large dataset is mostly needed - this is especially true for deep-learning. Obtaining large volumes of training data, however, is not always easy - it can be quite expensive. In some cases the occurrences of the scenario can be so rare that obtaining a large dataset will either take forever or is simply not possible. For eg. dataset for Merkel-cell carcinoma, a type of skin cancer that's very rare, will be very challenging to procure.

For this reason, data harvesting a.k.a. synthetic data generation is considered. Tirthajyoti Sarkar, the author of Data Wrangling with Python: Creating actionable data from raw sources, wrote an excellent post on data harvesting using scikit that covers this topic in detail. However, more recently, Generative Adversarial Networks (GAN) by Ian Goodfellow is being heavily used for this purpose. Synthetic Data for Deep Learning is an excellent review article that covers this topic in detail for deep-learning.

Give ML models e.g. (GAN) are being used to generate training data now, it's all the more important that reproducibility in such application is ensured. Let's say, we trained a near-perfect golden goose model on data (including some synthetic). But the storage caught proverbial fire, and we lost this golden goose model along with data. Now, we have to regenerate the synthetic data and obtain the same model but the synthetic data generation process is not quite reproducible. Thus, we lost the golden goose!

Challenges in realizing reproducible ML

Reproducible ML does not come in easy. A wise man once said:

When you want something, all the universe conspires in helping you to achieve it. - The Alchemist by Paulo Coelho

But when it comes to reproducible ML it's quite the contrary. Every single resource and techniques (Hardware, Software, Algorithms, Process & Practice, Data) needed to realize ML poses some kind of challenge in meeting reproducibility (see figure 6).

Figure 6: Overview of challenges in reproducible ML

1. Hardware

ML algorithms are quite compute hungry. Complex computation needed by ML operation, now a day's runs in order of Giga/Tera floating-point operations (GFLOPS/TFLOPS). Thus needing high parallelism and multiple central processing Units (CPU) if not, specialized hardware such as (general purpose) graphics processing unit (GPU) or more specifically (GPGPU), tensor processing unit (TPU) etc. to complete in the reasonable time frame.

But these efficiencies in floating-point computations both at CPU & GPU level comes at a cost of reproducibility.

  • CPU

Using Intra-ops (within an operation) and inter-ops (amongst multiple operations) parallelism on CPU can sometimes give different results on each run. One such example is using OpenMP for (intra-ops) parallelization. See this excellent talk titled "Corden’s Consistency of Floating Point Results or Why doesn’t my application always give" Corden 2018 for more in-depth insight into this. Also see wandering precision blog.

  • GPU

General purpose GPUs can perform vector operations due to stream multiprocessing (SEM) unit. The asynchronous computation performed by this unit may result in different results on different runs. Floating multiple adders (FMAD) or even reductions in floating-point operands are such examples. Some algorithms e.g. vector normalization, due to reduction operations, can also be non-reproducible. See reproducible summation paper for more info.

Changing GPU architecture may lead to different results too. The differences in SEM, or architecture-specific optimizations are a couple of reasons why the differences may arise.

See Corden's Consistency of Floating-Point Results or Why doesn’t my application always give the same answer for more details.

2. Software

It's not just hardware. Some software's offering high-level abstraction or APIs for performing intensive computation do not guarantee reproducibility in their routines. For instance NVIDIA's popular cuda based deep learning library cudnn do not guarantee reproducibility in some of their routines e.g. cudnnConvolutionBackwardFilter>sup>ref. Popular deep learning libraries such as tensorflowref 1,ref 2, pytorchref also do not guarantee 100% reproducibility.

There is an excellent talk Duncan Riach, maintainer of tensorflow_determinism on Determinism in deep learning presented at GPU technology conference by NVIDIA 2019

Sometimes it's not just a trade-off for efficiency but simple software bugs that lead to non-reproducibility. One such example is this bug that I ran into resulting in different geo-location upon same computation when a certain library version was upgraded. This is a clear case of software bug but underlines the fact that reproducibility goes beyond just computation, and precision.

3. Algorithm

Several ML algorithms can be non-reproducible due to the expectation of randomness. Few examples of these algorithms are dropout layers, initialization. Some algorithms can be non-deterministic due to underlined computation complexity requiring non-reproducible measures similar to ones discussed in the software section. Some example of these are e.g. vector normalization, backward pass.

4. Process & Practice

ML loves randomness!

Figure 7: Randomness defined by xkcd

When things don't work with ML - we randomize (pun intended). We have randomness everywhere - from algorithms to process and practices for instance: - Random initializations - Random augmentations
- Random noise introduction (adversarial robustness) - Data Shuffles

To to ensure that randomness is seeded and can be reproduced (much like earlier example of scikit linear regression), with python, a long list of seed setting ritual needs to be performed:

os.environ['PYTHONHASHSEED'] = str(seed)
random.seed(seed)
tensorflow.random.set_seed(seed)
numpy.random.seed(seed)
tensorflow.keras.layers.Dropout(x, seed=SEED)
tensorflow.image.random_flip_left_right(x, seed=seed)
tensorflow.random_normal_initializer(x, y, seed=seed)
# many such algorithmic layers as aabove 

Can we ever win with this seed setting?

Figure 8: Seed setting (image credit: google)

5. Data

No input is ever really independent. [Scully et. al 2015][scully_2015]

Data is the main input to ML algorithms and these algorithms are just compute hungry but also data-hungry. So we are really talking about big data. When data volume is large, we are dealing with all sorts of challenges: - Data management - Data provenance
- Data poisoning - Under-represented data (inappropriate demographic) - Over-represented data (inappropriate demographic)

One of the reasons why ML is so iterative because we need to evolve the ML algorithm with data whilst also continuously evolving data (e.g. data massaging, feature engineering, data augmentations). That's why data provenance is important but it's also important to maintain a linage with data provenance to ML processes. In short, an end to end provenance is needed with ML processes.

6. Concept drift

A model is rarely deployed twice. [Talby, 2018][Talby]

One of the reason for why a model rarely gets deployed more than once [ref][Talby] is Concept drift. Our concept of things and stuff keeps evolving. Don't believe me? figure 9 shows how we envisaged car 18's to now. Our current evolving impression of the car is solar power self-driving cars!

Figure 9: Our evolving concept of car

So, now we don't just have to manage reproducibility over one model but many! Because our model needs to continually keep learning in a more commonly known term in ML as Continual learning more info. An interesting review paper on this topic is here.

Figure 10: Top features - Dresner Advisory Services Data Science and Machine Learning Market Study

In fact, Continual learning is so recognized that support for easy iteration & continuous improvement were the top two features industry voted as their main focus with ML as per Dresner Advisory Services’6th annual 2019 Data Science and Machine Learning Market Study (see figure 10).


The next part of this technical blog series, [Reproducibility in Machine Learning], is Realizing reproducible Machine Learning - with Tensorflow.

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

[scully_2015]: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-Machine Learning-systems.pdf

[Talby]: https://www.oreilly.com/radar/lessons-learned-turning-Machine Learning-models-into-real-products-and-services/

Reproducibility in Machine Learning blog series

This technical blog series titled "Reproducibility in Machine Learning" is going to be divided into three parts: 1. Reproducibility in Machine Learning - Research and Industry 2. Realizing reproducible Machine Learning - with Tensorflow 3. End-to-end reproducible Machine Learning pipelines on Kubernetes

Some of the content of this blog series has been covered in KubeCon US 2019 - a Kubernetes conference 2019. Details of this talk can be found here with recording available here.

Part 1: Reproducibility in Machine Learning - Research and Industry

In Part 1, the objective will be to discuss the importance of reproducibility in machine learning. It will also cover where both research and industry are stand in writing reproducible ML. This blog can be accessed here.

Part 2: Realizing reproducible Machine Learning - with Tensorflow

The focus of Part 2 will be writing reproducible machine learning code. Tensorflow is being used as a machine learning stack for demonstration purposes. This blog can be accessed here.

Part 3: End-to-end reproducible Machine Learning pipelines on Kubernetes

Part 3 is all about realizing end-to-end machine learning pipelines on kubernetes. This blog can be accessed here.