Skip to content

AI#

Launch of Deep Learning at Scale - An O'Reilly Book

I am thrilled to announce the release of "Deep Learning at Scale: At the Intersection of Hardware, Software, and Data" - an O'Reilly Book"! I have been working on this project for over 2 years.

"Deep Learning at Scale: At the Intersection of Hardware, Software, and Data" (O'Reilly) illustrates complex concepts of full-stack deep learning and reinforces them through hands-on exercises to equip you with tools and techniques to scale your project. Scaling efforts are only beneficial when they are effective and efficient. To that end, this guide explains the intricate concepts and techniques that will help you scale effectively and efficiently.

Order your copy today

To order your copy, use the following links based on your preferred format:

Note

- Amazon | - Amazon AU

Note

- Amazon | - Amazon AU

Alternatively, you can access the book using the 30-day trial link:

30 Days trial access by O'Reilly Media

More info

For more information, see details in the project

ChatGPT vs Me: As a Children's Authors

For all the right reasons, large language models have taken the world by storm! It’s pretty impressive what ChatGPT can do - unprecedented coherence, relevance and tone of delivery for a synthetic text generation. Well done OpenAI!

I have to admit I have strongly resisted the urge to use it for purposes beyond joy, and exploration. It's been very hard to abstain, especially because I am extremely passionate about AI and also I work in AI. I truly, madly, deeply, utterly fear that we, as a community of carers of young people, are simply NOT ready for it! It’s the infancy state of due diligence to understand the societal implications and governance around it that scares me the most. I fear that while we, the grown-ups, are probably ready and raring to go at it, it’s our younger generations who are not at all ready for it. Once this genie is out of the bottle, out is just the way! And it’s the repeat of our social media fiasco all over again. Perhaps I am coming from a dark place of seeing how exponentially easily our kids are getting social media bullied and exploited with authorities still scrambling even after the fact on how to contain it!

On a lighter note, selfishly speaking, I also fear depleting my creative and critical thinking abilities in favour of convenience and productivity. If Darwin’s theory is anything to go by, that’s one human talent we should preserve and reinforce, I think. Did I hear you say “But you can still be critical of the presented text?”? Sure, but we are also a Large Language Model in many ways :) - we get biased and we are lazy. Is not laziness the inspiration for many useful inventions of our times?

As a mother of young kids and as someone who believes in lifelong learning, I made a tiny effort last year to write chapter books for early readers covering the discoveries of the great scientists of our past and present. In my debut release, I featured Sir Isaac Newton’s discoveries in a book titled “Curious Cassie's beach ride quest: A Celebration of Isaac Newton's Discoveries”. I tested the book on a sample of 5 kids all ranging from 6-7 years and observed extreme enthusiasm and an increased number of questions thereafter. Mission accomplished, eh?

I failed this morning! I failed to resist the chatGPT urge to see what it would write instead. So here’s Me vs chatGPT, a debut author vs a stochastic parrot:

ChatGPT vs Me: Show me the script

ChatGPT

Prompt

Write a chapter book for 6 year old kids

ChatGPT's Response
Chapter 1: The Magical Forest

It was a beautiful day in the magical forest. The sun was shining and the birds were singing. The trees were tall and green, and the flowers were blooming.

In the center of the forest was a small cottage, where a young girl named Lily lived with her parents. Lily was six years old, and she loved to explore the forest.

One day, while out on a walk, Lily stumbled upon a strange and magical place. It was a clearing filled with glittering diamonds and sparkling jewels.

"Wow!" exclaimed Lily. "I've never seen anything like this before!"
As she reached out to touch one of the diamonds, a bright light shone and a magical fairy appeared before her.

"Hello, Lily," said the fairy. "I am the guardian of the forest, and I have been waiting for you. You have a very special task to complete."

"Me?" asked Lily. "What can I do?"

"There is a wicked witch who has been causing trouble in the forest," explained the fairy. "She has been turning all of the animals into stone, and we need your help to stop her."

"I'll do it!" said Lily, determined to help the animals.

Chapter 2: The Wicked Witch

Lily set out on her journey to find the wicked witch. She followed the fairy's instructions and travelled deep into the forest.
As she walked, she saw that the trees were no longer green and the flowers had withered away. It was clear that the witch had been causing destruction wherever she went.

Finally, Lily came to a dark and gloomy castle. She knew this was where the witch lived.

She tiptoed through the castle, trying not to make a sound. She could hear the witch cackling in the next room.

Lily peeked through the door and saw the witch stirring a bubbling pot. She had a long, crooked nose and sharp, pointed teeth.

"Ah-ha!" shouted the witch, turning to face Lily. "I've been expecting you, little girl. You won't be able to stop me from turning all of the animals into stone!"

But Lily was not afraid. She remembered the fairy's words and knew she had to be brave.

"I won't let you harm any more animals!" said Lily, standing up to the witch.

The witch cackled and raised her wand, but Lily was quick. She grabbed a nearby broom and swung it at the witch, knocking the witches wand out of her hand.

The witch let out a cry of rage and fled the castle, leaving Lily to save the animals.

Chapter 3: The Great Celebration

With the witch gone, Lily was able to use her magic to turn the animals back to normal. The forest was filled with joy and celebration.

The fairy thanked Lily for her bravery and offered to grant her one wish.

"I wish for the forest to always be a happy and safe place," said Lily.

The fairy nodded and waved her wand, granting Lily's wish. From then on, the magical forest was filled with happiness and peace.
Lily returned home to her parents, who were overjoyed to see her safe and sound. They threw a great celebration in her honour, and the whole forest came to join in the fun.

Lily knew that she would always treasure her adventure in the magical forest, and she hoped to have many more adventures in the future.
Verdict

Very impressive choice of theme, tone and word selection. It did not give me a title, so I asked to name the title and it responded with "The Brave Explorer: A Magical Adventure in the Forest." Very cool!

Prompt

Write a chapter book for 6 year old kids on STEM

ChatGPT's Response
Chapter 1: The Science Fair

It was a beautiful day in the town of Maplewood, and Lily was excited for the annual science fair at her school. She had been working on her project for weeks, and she couldn't wait to show it off.

Lily's project was a model of the solar system, complete with miniature planets made out of clay and a sun made out of papier-mâché. She had worked hard to make sure everything was accurate and to explain the orbits of the planets in a way that other kids could understand.

As the science fair began, Lily set up her project at her table and waited for the judges to come by. She was a little nervous, but she knew she had done her best.

The judges were impressed by Lily's project and asked her lots of questions about the solar system. Lily was happy to answer and explain what she had learned.

In the end, Lily's project won first place in the science fair! She was so proud of herself and couldn't wait to show her parents.

Chapter 2: The Technology Club

After the science fair, Lily's teacher, Mrs. Kim, asked her if she would like to join the Technology Club at school. Lily was excited at the chance to learn more about science and technology, and she eagerly accepted.

In the Technology Club, Lily and her friends learned about all sorts of different topics, from coding and robotics to engineering and renewable energy. They built and programmed robots, designed and tested bridges, and learned about how solar panels work.

Lily loved the Technology Club and couldn't wait to go each week. She learned so much and had a lot of fun doing it.

Chapter 3: The Engineering Competition

One day, Mrs. Kim announced that the Technology Club was going to participate in an engineering competition. The students would have to design and build a Rube Goldberg machine, a complicated contraption that could perform a simple task in a creative and convoluted way.

Lily and her friends were up for the challenge and set to work planning and building their machine. They worked together as a team, using their knowledge of physics, engineering, and problem-solving to make their machine work.

Finally, the day of the competition arrived. Lily and her friends were nervous but excited to see how their machine would perform.

As they presented their machine to the judges, it worked perfectly, performing the simple task with a series of intricate steps. Lily and her friends were thrilled when they were announced as the winners of the competition!

Chapter 4: The Future Scientists

After the engineering competition, Lily and her friends in the Technology Club knew that they wanted to continue exploring science and technology in the future. They were already thinking about new projects and competitions they could participate in.

Lily knew that she wanted to be a scientist when she grew up, and she was grateful for all of the opportunities she had had to learn and explore in the Technology Club.

She knew that no matter what she did in the future, she would always have a passion for science and technology and the desire to learn more.
Verdict

Nice plot setting, nice choice of theme, and context. Nice job. I find it's very peripheral, explaining a lot about the setting but nothing about science. Maybe it’s my prompt. Let's give it another go.

Prompt

Write a 6-7 year old kids’ book on Sir Isaac Newton's discoveries with Cassie as a protagonist

ChatGPT's Response
Cassie and the Apple Tree

Cassie was a curious little girl who loved to learn about the world around her. One day, she was walking through her backyard when she saw an apple fall from a tree.

"Hmm, I wonder why that happened?" Cassie thought to herself. She climbed up the tree to see if she could find out more.

As she reached the top, she saw a man standing under the tree. It was Sir Isaac Newton! Cassie had heard of him before. He was a very smart man who had made many important discoveries.
"Hello, Cassie," Sir Isaac said. "I see you're curious about the apple."

"Yes, I am," Cassie replied. "Why did it fall to the ground?"
"Ah, that's a good question," Sir Isaac said. "I figured out that there is a force called gravity that pulls objects towards the ground. That's why the apple fell."

Cassie was amazed. "Wow, that's so cool!" she exclaimed.
Sir Isaac smiled. "I also made many other discoveries," he told Cassie. "Like the laws of motion, which explain how objects move and change speed. And I invented calculus, which helps us understand how things change over time."

Cassie's head was spinning with all this new information. She couldn't wait to learn more about Sir Isaac's discoveries.

"Thank you for telling me all about this, Sir Isaac," Cassie said. "I can't wait to learn more!"

"You're welcome, Cassie," Sir Isaac replied. "I'm glad you're interested in learning. Remember, the more you learn, the more you'll discover about the world around you."

Cassie smiled and climbed back down the tree, excited to keep learning and discovering new things.
Verdict

Nice choice of words, context is spot on, and tone of delivery is nice too. Some of the phrases are spot on. "Wow, that's so cool!" she exclaimed.” is something I have used too.

Apple and Newton are joined together forever. Cliché, my dear parrot! Cliché! Well done still. Maybe I like more challenges given to the kids, but I still feel it’s too shallow and lacks details. Let’s give it more work and see what ChatGPT does.

Prompt

Write a 6-7 year old kids chapter book on Sir Isaac Newton's discoveries with Cassie as the protagonist with illustrations

ChatGPT's Response
Chapter 1: Cassie and the Apple Tree

[Illustration: Cassie is standing in her backyard, looking up at an apple falling from a tree.]

Cassie was a curious little girl who loved to learn about the world around her. One day, she was walking through her backyard when she saw an apple fall from a tree.

"Hmm, I wonder why that happened?" Cassie thought to herself. She climbed up the tree to see if she could find out more.

[Illustration: Cassie is climbing up the tree, with a determined look on her face.]

As she reached the top, she saw a man standing under the tree. It was Sir Isaac Newton! Cassie had heard of him before. He was a very smart man who had made many important discoveries.

"Hello, Cassie," Sir Isaac said. "I see you're curious about the apple."

"Yes, I am," Cassie replied. "Why did it fall to the ground?"

"Ah, that's a good question," Sir Isaac said. "I figured out that there is a force called gravity that pulls objects towards the ground. That's why the apple fell."

[Illustration: Sir Isaac is pointing to a diagram of the earth and the apple, with an arrow labeled "gravity" pointing from the earth to the apple.]

Cassie was amazed. "Wow, that's so cool!" she exclaimed.

Chapter 2: The Laws of Motion

After learning about gravity, Cassie couldn't wait to find out more about Sir Isaac's discoveries. She asked him to tell her more.

"I also made many other discoveries," Sir Isaac said. "Like the laws of motion, which explain how objects move and change speed."

[Illustration: Sir Isaac is holding a book labeled "Laws of Motion," with diagrams of moving objects in the background.]

Cassie was fascinated. Sir Isaac explained that there are three laws of motion:

    Objects at rest stay at rest, and objects in motion stay in motion with the same speed and direction, unless acted upon by a force.
    The force needed to change the motion of an object is equal to the mass of the object times its acceleration.
    For every action, there is an equal and opposite reaction.

[Illustration: Sir Isaac is pointing to a chalkboard with the three laws of motion written on it, with illustrations of objects in motion on either side.]

"These laws help us understand how objects move and behave in the world around us," Sir Isaac said.

Chapter 3: Calculus

Cassie couldn't believe how much she was learning from Sir Isaac. She asked him if there was anything else he had discovered.

"Yes, I also invented calculus," Sir Isaac said. "It helps us understand how things change over time."

[Illustration: Sir Isaac is holding a book labeled "Calculus," with a diagram of a curve on the cover.]

Cassie was a little confused. "What do you mean?" she asked.

Sir Isaac explained that calculus is a way of studying how things change. "For example, we can use calculus to understand how fast a ball is rolling down a hill, or how fast a plant is growing," he said.
[Illustration: Sir Isaac is pointing to a graph with a curve on it, with a ball rolling down a hill in the background.]
Cassie was amazed. "That's so cool!" she exclaimed.
Chapter 4: The End
After learning all about Sir Isaac's discoveries, Cassie was even more excited to keep learning and discovering new things. She thanked Sir Isaac for all of his help and said goodbye.
"Remember, Cassie," Sir Isaac said as she walked away. "The more you learn, the more you'll discover about the world around you."
[Illustration: Cassie is waving goodbye to Sir Isaac, with a big smile on her face.]
Cassie smiled and waved, excited to keep exploring and learning. She knew that there were so many more mysteries to solve and discoveries to be made.
Verdict

I love this line "The more you learn, the more you'll discover about the world around you." ChatGPT, You can be nice if not exploited! I liked the mention of calculus. It’s a brilliant thought but I stopped at algebra in my book!

However, I remain unsatisfied given the lack of detail and meat around the topic. I feel, for now, human writing is more creative, rich and informative! I am glad that’s the way it is!

Me! Me!

Title: Curious Cassie's beach ride quest: A Celebration of Isaac Newton's Discoveries

[Illustration: Cassie is riding in a car with her family! She is buzzing with interesting curious keywords like Wow, What, How, When, Where! Sir Isaac is featured in the book.]

Here's a sample of my book. It shows 1st chapter of the 3-chapter book.

I am just a debut author, writing out of hobby. I am profoundly impressed by authors who have written brilliant books to teach, guide and shape our children. I hope we will continue to create far-enriched content to shape ourselves and our future with or without AI - just as long as we are in control.

Thanks a lot for reading! If you ever see any value in spreading the word about my book or procuring it for your loved ones then please do so, I would be thrilled!

Disclaimer: This is a personal post. Views shared are my own and do not represent my employers.

Pydra - Pydantic and Hydra for configuration management of model training experiments

How do you manage the configurations of your model training experiments? Having good configuration management in place improves the user experience and simplifies experiment management. Some of the indirect advantages of having a good config solution are clean, reliable, and simple application code. This is true for any software application however some applications demand higher investment on the config piece than others. One key differentiator perhaps here is how many config fields one has and how each of these relates to the other. This can get real messy real quick with deep learning experiments where the list of config and hyperparameters can rapidly grow out of hand as soon as one breaks out of adhoc experimentations.

Context

By definition model training configurations are hierarchial. For example, an experiment configuration can be broken down into data and model specifications. While these are loosely coupled specifications, they may enforce some degree of restrictions on each other e.g. model's output specification needs to be in line with data specification. A classifier network needs to output a 10 length vector if the data source is MNIST and 20 if it's PASCAL VOC.

Overall, configuration solution for training experiments has following requirements:

  1. Ability to easily templatize configurations of various components of your training experiments.
  2. Ability to organize the configuration of training in a hierarchical fashion acknowledging the dependency and hierarchy of participating components.
  3. Without duplication, reuse configurations across modules wherever needed by means such as interpolation, expressions.
  4. Ability to compose config using templates and applying customization through overrides, environments etc.
  5. Being able to store and reuse the final assembled/compiled config and reproduce the experiments reliably.
  6. Support structured config consisting of potentially complex, canonical python types with strong validation support beyond type hint. This perhaps is optional if one is not interested in dynamic domain-specific validations - something that can not be achieved in serialized formats like yaml/json.
  7. Ability to version config and enforce a schema on the config specification for reliability and robustness.
  8. Ability to integrate easily into complex distributed launcher frameworks for both local and remote launch.
  9. A configuration management that nicely integrates into your echo-system of tools and is not very dictative.

With that in mind, I evaluated a bunch of tools combination of Hydra and Pydantic stood out as a good choice. In this post, I will go over my learnings of both the tools and also talk about gaps I still see present. Lastly, put forth a tentative solution for this gap. The structure of this post is as follows:

  1. Hydra - its strengths and also, what I think are its weakness.
  2. Pydantic - how it can be used to fill in the void of Hydra
  3. The missing bits of config management from Pydra i.e. both Hydra & Pydantic
  4. The solution to bridging the gap
  5. Conclusion

Note: The samples and code shared in this post are available here.

Hydra

Hydra is a dynamic hierarchical config management tool with the ability to override parts of the config and also to compose the config through various sources (eg command line args, environment variables). It offers a nice solution to templated configurations that can be repurposed dynamically based on your objectives. Beyond rewriting rules, compositions, and interpolations to re-config templates, it also provides a mechanism to represent the config in plain old python object style and can take care of serialization and deserialization itself. On some noteworthy features side - with hydra you can fire multi-process workload on multiple instances of configurations derived from the same template source. This can be a really handy feature for many use cases including side-by-side comparisons.

It is also extensible in a way that one can write their plugins to launch local or remote workload. Some other nice noteworthy features are runner config and callbacks. All in all, powered by OmegaConf this is a great tool for config management.

A simple non-structured config example

Let's look at a simple use of hydra for yaml based config without involving python object model for config. This example is borrowed and extended from a hydra sample. . this example demonstrates 3 config modules i.e. db, schema and ui assembled togather via high level config.yaml.

Example hierarchical hydra config.

import logging
from dataclasses import dataclass
from typing import List, Optional

import hydra
from hydra.core.config_store import ConfigStore
from hydra.core.hydra_config import HydraConfig
from omegaconf import MISSING, DictConfig, OmegaConf

log = logging.getLogger(__name__)


@hydra.main(config_path="conf", config_name="config")
def my_app(cfg: DictConfig) -> None:
    log.info("Type of cfg is: %s", type(cfg))
    cfg_dict = OmegaConf.to_container(cfg, throw_on_missing=False, resolve=True)

    log.info("Merged Yaml:\n%s", OmegaConf.to_yaml(cfg))

    log.info(cfg.db)
    log.info(cfg.db.driver)

    log.info(HydraConfig.get().job.name)


if __name__ == "__main__":
    my_app()
Sample code can be found here

A few things happening in this script: 1. How is config loaded: The above script tells hydra to look for hierarchical config under conf as a base location for config directory, and config.yaml as top-level config specification. This setting is defaulted by python decorator @hydra.main(config_path="conf", config_name="config"). A user can override these settings and point the config location to another place by using --config-dir for the base directory and --config-name for top-level config.yaml as follows:

python my_app_non_structured.py --config-dir conf_custom_hydra
python my_app_non_structured.py --config-name config_hydra
2. What format can I access config: Once hydra has loaded the config, it will be loaded as dict/dictionary and surfaced as DictConfig from OmegaConf. The benefit DictConfig is you can access the fields as variable accessors e.g. cfg.db.driver as shown in code above instead of dict str key access pattern. 3. Where is composition coming from: What's defined in conf directory is template config. Some values may be missing - which can be represented in yaml as ???. It's also possible to override some values from the template config at runtime. An example of this is: python my_app_non_structured.py db.user=suneeta Here, the final used configuration that the application will see is:
db:
  driver: mysql
  user: suneeta
  pwd: secret
ui:
  create_db: true
  view: true
schema:
  database: school
  tables:
  - name: students
  - name: exams
4. What is the final config used in my run: Hydra saves the specifics of the config used in the run in its output location. By default the output location is outputs/YYYY-MM-DD/HH-MM-SS/.hydra/config.yaml which can be used to rerun the same run for reproducibility.
python my_app_non_structured.py --config-dir=outputs/2022-03-16/13-45-29/.hydra
Hydra also allows you to configure job and run settings, an example of such configuration is located here. Alternatively, one can also override these settings in the main config.yaml file:

defaults:
  - _self_  # Note: to source default of hydra config 
  - db: mysql
  - ui: full
  - schema: school

hydra:
  job:
    env_set: 
      HDF5_USE_FILE_LOCKING: "false"
      ## Useful for distributed train setup
      # RANK: ${hydra.job.num}
    env_copy: ## Local environment copy
      - API_KEY  

  job_logging:
    version: 1
    formatters:
      simple:
        format: "[%(levelname)s] - %(message)s"
    handlers:
      console:
        class: logging.StreamHandler
        formatter: simple
        stream: ext://sys.stdout
    root:
      handlers: [console]

    disable_existing_loggers: false

  run:
    dir: outputs/${hydra.job.name}/${now:%Y-%m-%d_%H-%M-%S}
5. Multi-run and templates: Multi-run allows users to run the multi-process workload on the same template of config but dynamically changing some values of it. Effectively, the user will be running multiple processes of the same code under different configurations. Parameter sweeps are a really good use case for when this may come in handy. In the following run, multirun is doing the trick triggering 3 different processes each for schema of school, support, and warehouse.
python my_app_non_structured.py db.user=suneeta schema=school,support,warehouse  --config-dir conf_custom_hydra --multirun

Each process gets an int id assigned which the application can recognize. The index starts from 0, as shown in fig below.

Example Of multi-runs.

This is especially useful in distributed training runs where things like RANK of process needs to be known. This can be easily configured using hydra as following:

hydra:
  job:
    env_set: 
      RANK: ${hydra.job.num}

An extension using structured config

If your application configs are slightly more complex and you might want object representation of your config then Hydra provided structured config option with ConfigStore. An example of this is here.

In this case, you define your config object model as follows:

@dataclass
class DBConfig:
    user: str = MISSING
    pwd: str = MISSING
    driver: str = MISSING
    timeout: int = 100


@dataclass
class MySQLConfig(DBConfig):
    driver: str = "mysql"


@dataclass
class PostGreSQLConfig(DBConfig):
    driver: str = "postgresql"


@dataclass
class View:
    create_db: bool = False
    view: bool = MISSING


@dataclass
class Table:
    name: str = MISSING


@dataclass
class Schema:
    database: str = "school"
    tables: List[Table] = MISSING


@dataclass
class Config:
    # We can now annotate db as DBConfig which
    # improves both static and dynamic type safety.
    db: DBConfig
    ui: View
    schema: Schema
A few things noteworthy to call out here: 1. MISSING is the replacement of yaml equivalent of missing denoted as ???. 2. Hydra can only work with plain old python objects and at best with dataclasses or drop-in replacements of dataclasses. 3. Structured config can not work with python primitives and cant serde (serialize and deserialize) canonical or complex python objects. Not even things like python pathlib Path.

Once, the config object model is done, one should define ConfigStore and register config objects:

cs = ConfigStore.instance()
cs.store(name="config", node=Config)
cs.store(group="db", name="mysql", node=MySQLConfig)
cs.store(group="db", name="postgresql", node=PostGreSQLConfig)
cs.store(name="ui", node=View)
cs.store(name="schema", node=Schema)

Here, group key in config registry indicates mutually exclusive groups. More about these can be found here.

How does hydra injection change then to support config store? Not by a lot, as shown in this example, changing DictConfig to your high-level config object Config in this case does the trick. Then, OmegaConf.to_object(cfg) unpacks dict config in your python object model.

@hydra.main(config_path="conf", config_name="config")
def my_app(cfg: Config) -> None:    
    log.info("Type of cfg is: %s", type(cfg))
    cfg_dict = OmegaConf.to_container(cfg, throw_on_missing=False, resolve=True)

    cfg_obj = OmegaConf.to_object(cfg)

This is all the application needs to do to move to structured config and features discussed earlier can be explored using the following commands:

python my_app.py
python my_app.py db.user=suneeta    
python my_app.py db.user=suneeta --config-dir conf_custom_hydra
# python my_app.py db.user=suneeta --config-name config_hydra
# Multi-run
python my_app.py db.user=suneeta schema=school,support,warehouse  --config-dir conf_custom_hydra --multirun
Code located here

There is another approach Hydra offers to instantiate objects through the config. This approach requires the user to know which config is object-able beforehand as unpacking them into objects requires explicit calls to instantiate API. An example of such use is as follows, notice fully qualified class name in __target__ field in config:

Example Of instantiate objects.

This feature, in my view, should be used sparingly as it tightly couples the config to application-specific objects. This kind of mixed-breed config handling approach is very hard to maintain and understand.

Some other interesting features of hydra

Interpolation

Interpolation offers a means to refer to an existing config instead of duplicating it. It also allows users to use derive values obtained through string interpolation in config. An example of interpolation is shown below:

Example Of instantiate objects.

Hydra Directives

Hydra allows overriding package configuration employing directives; documentation for these is here. In short, they offer means to rewrite and relocate config dynamically. I think this is evil and facilitates yaml engineering. It can also be a debugging nightmare if your intended use of config is beyond standard interfaces.

Example Of directives to rewrite config specifications.

Talking about directives, __global__ are quite handy in managing configuration overrides. One example is shown here on hydra samples itself

Where Hydra falls short?

Overall, OmegaConf and Hydra are two powerful tools for config management, however, I personally find the following shortcoming in these tools:

  1. No support to serde (serialize and deserialize) canonical or complex python objects can be really annoying and causes more overhead for applications. Only primitive supports are a good start.

    1.1. OmegaConf and Hydra does not support Union type in the config either.

  2. Support for structured config type is limited to plain old Python objects or dataclasses (and drop-in replacement thereof).

  3. Validation framework is a must-have in any config management framework. Anything users provide must be validated in the application for various things like type-checking, application-specific validation, constraint limits, etc. This is currently offloaded largely to application. The choice application is left with is using structured config as simple objects and performing all validation explicitly including obvious things like type checking. This is limiting in my view.

  4. Interpolation is great to have feature. Extension to this is being able to run expression-based interpolations eg list of items to the total item in the list. However, this support is lacking in Hydra with issues being raised.

Pydantic

Pydantic offers a runtime type enforcement and validation and is itself a data validation and settings management tool. It works with pure canonical python data models and provides two approaches for this: 1. Drop in replacement of dataclass 2. BaseModel abstraction

Through its validation framework and type checking pydantic offers schema enforcement and can serialize and deserialize objects from various formats including but not limited to yaml/json, xml, ORM, etc. Its field type support especially constrained types are very rich and vastly simplifies application config validations. Eg can the image size be less than 0? What's the best approach to validate this? limit at config level or sprinkle that in the application where config is used?

Overall, pydantic is a well-documented library, an example of its dataclass use is as shown below:

import logging
import sys
from pathlib import Path
from typing import Optional, Union

from pydantic import BaseModel, conint, validator
from pydantic.dataclasses import dataclass


log = logging.getLogger(__name__)


@dataclass
class Dataset:
    name: str
    path: str

    @validator("path")
    def validate_path(cls, path: str) -> Path:
        if Path(path).exists():
            print("exist")
        return Path(path)


@dataclass
class Model:
    type: str
    num_layers: conint(ge=3)
    width: Optional[conint(ge=0)]

    @validator("type")
    def validate_supported_type(cls, type: str) -> str:
        if type not in ["alex", "le"]:
            raise ValueError(f"'type' canonly be [alex, le] got: {type}")
        return type


@dataclass
class Config:
    dataset: Dataset
    model: Model
Notice the use of the validator decorator here. This is an approach to validate for application-specific logics.

BaseModel comes with more bells and whistles as it provides syntactic sugar of Config inner class that can be used to configure the behavior of a concrete implementation of BaseModels. Details of this can be found here. Some of the settings like anystr_lower and use_enum_values are so handy. Who does not want to work with case insensitive configurations but keep the application clear of case conversions? Like BCE and bce means any different than binary cross entropy? BaseModel implementation looks like this:

class Dataset(BaseModel):
    name: str
    path: str

    @validator("path")
    def validate_path(cls, path: str) -> Path:
        if Path(path).exists():
            print("exist")
        return Path(path)

    class Config:
        title = "Dataset"
        max_anystr_length = 10
        allow_mutation = False
        validate_assignment = True
        anystr_lower = True
        validate_all = True
        use_enum_values = True 
Full example code is located here

BaseSettings is also another great feature that seeks to abstract away settings/config that is sourced from other sources like an environment variable, init files, etc.

class Secrets(BaseSettings):
    api_key: str = Field("", env="api_key")
    token: str

    class Config:
        # case_sensitive = True
        env_nested_delimiter = "_"

        @classmethod
        def customise_sources(
            cls,
            init_settings: SettingsSourceCallable,
            env_settings: SettingsSourceCallable,
            file_secret_settings: SettingsSourceCallable,
        ) -> Tuple[SettingsSourceCallable, ...]:
            return env_settings, init_settings, file_secret_settings

An few interesting notes

  1. It's interesting to note that the validator of pydantic can be repurposed to implement derived fields. An example of this is captured here: The use-case being, a list of classes is already provided in config in another module say data. You want to "infer" the total number of classes from this in the model config. In this case, you can interpolate and copy over a list of classes from the data module to model but your requirement of actually only caring about total classes is unfulfilled as you cant do expression-based interpolation just yet. In this case, total_classes can be defined as either list of int and then the total is deduced upon validation, leaving your in config object always representing total int value and logic-based deduction happening upon validation. Example of this is as follows:

from typing import List, Union
from pydantic import BaseModel, conint, validator

class Model(BaseModel):
    total_classes: Union[conint(ge=0), List[conint(ge=0)]]

    @validator("total_classes")
    def validate_classes(cls, total_classes: Union[conint(ge=0), List[conint(ge=0)]]) -> conint(ge=0):
        if isinstance(total_classes, List):
            return len(total_classes)
        return total_classes
I have not seen any issue in the use of validator like this but unsure if this is the intended use of validator support. Also, unsure why the validator returns in the said interface.

  1. Hydra's benchmarking is pretty favorable (see image below). However, they dont benchmark against OmegaConf and some other related tools, see details here.

Hydra's benchmarks.

Can Pydantic bridge the gaps in hydra

Pydantic offers a strong validation framework something that's been missing in hydra. While drop-in replacement of pydantic dataclass works with hydra one can still miss out on the syntactical sugars of Model Config and still be limited to primitives and non Union types and more interesting constrained types of pydantic.

It's fair to say that direct integration of hydra with pydantic is absent. The follow-up question should be, can we make it work? Sure we can :).

An example demonstrating how Hydra and Pydantic can be used in conjunction is shown here. The secret sauce of integration lies in the following:

  1. Giving up on the use of ConfigStore of hydra
  2. Reading the config as DictConfig and unpacking to Pydantic BaseModel object via dict

In short, this is two-liner:

    OmegaConf.resolve(cfg)
    r_model = Config(**cfg)
where Config is a concrete implementation of Pydantic's BaseModel. The decision we make it here is configs are specified in yaml format and we leave Hydra to manage template, composition, and all the good things of non-structured config management. Upon resolution and merge of config, we deserialize with Pydantic and get all the goodies of Pydantic validation, check, etc.

How well does this integrate with other ecosystem tools

Given hydra can provide a final configuration in dict format, the configuration side can easily integrate with echo-system tools that care about configs. Given most experiment management and tracking tools support managing config as dict, this is less of a challenge. A post from Weights and biases goes in detail about integration.

It's interesting to note that other features of Hydra like multirun and sweeps given they employ multi-processing may interfere with other tools employing MP techniques too.

Version config

Versioning is very important practice to follow if one works with large, actively evolving code base. For the same reasons as REST APIs should be versioned, docker images are versioned, library releases are versioned, and so on, application config specifications should be versioned as well. If not for incremental updates then at least for breaking changes. Kubernetes actually does this really well, where users can bring their resource specification of supported version and it can work with that version creating ../resources for you. There is lot more complexity in config management that Kubernetes immplements but whats transferable here in my view is:

  1. When creating a run, always use latest version of config.
  2. Have a support in place where you can migrate your old configs to latest version of config so you can reproduce [not repeat] your experiments with newer features should you need to.

This prototype is a toy example to explore how to handle versioning in the config and provide a migration path for when old version config is provided. The assumption made here is, your training code only ever works with the latest version of config so the application code is simple and is always progressive. Whenever a breaking change is introduced, a new float version of config is created while the preceding version of config implements a migration path to the following version.

Code shown below, represents Model config which has 3 versions v i.e. 1.0, 2.0, 3.0. Each version makes a breaking change to the previous but each concrete implementation implements a method def to_next(self) -> Optional[BaseModel] that cracks at migrating the config to a new specification.

class VersionModel(BaseModel, abc.ABC):
    v: Literal["inf"] = Field(float("inf"), const=True)

    class Config:
        title = "Model"
        allow_mutation = False
        max_anystr_length = 10
        validate_assignment = True
        anystr_lower = True
        validate_all = True
        use_enum_values = True

    def version(self) -> float:
        return self.v

    def _next_version_model(self) -> float:
        versions = list(fake_registry.keys())
        versions.sort()
        idx = versions.index(self.v) + 1
        if idx == len(versions):
            return None
        return versions[idx]

    def next_version(self) -> BaseModel:
        next_class = fake_registry.get(self._next_version_model())
        return next_class

    @abc.abstractmethod
    def to_next(self) -> BaseModel:
        pass


class Model1(VersionModel):
    # https://pydantic-docs.helpmanual.io/usage/types/#arguments-to-constr
    type: str  # constr(to_lower=True)
    num_layers: conint(ge=3)
    width: Optional[conint(ge=0)]
    # v: float = Field(1.0, const=True)
    zoom: conint(gt=18) = 19
    v: Literal[1.0] = Field(1.0, const=True)

    # @classmethod
    # def to_next(cls, instance: Type[VBT]) -> BaseModel:
    def to_next(self) -> Optional[BaseModel]:
        next_class = self.next_version()
        if next_class != Model2:
            raise Exception("WTH")

        logging.warning("================================================")
        logging.warning("==============Migrating from 1->2 ==============")
        logging.warning("================================================")

        d = self.dict()
        d.pop("v")
        return Model2(**d)


class Model2(Model1):
    width: conint(ge=5)
    context: conint(ge=256) = 256
    # v: float = Field(2.0, const=True)
    v: Literal[2.0] = Field(2.0, const=True)

    def to_next(self) -> Optional[BaseModel]:
        next_class = self.next_version()
        if next_class != Model3:
            raise Exception("WTH")

        logging.warning("================================================")
        logging.warning("==============Migrating from 2->3 ==============")
        logging.warning("================================================")

        return Model3(
            new_context=self.context,
            num_layers=self.num_layers,
            type=self.type,
            width=self.width,
            zoom=self.zoom,
        )

class Model3(Model1):
    new_context: conint(ge=512, le=1300) = 512
    v: Literal[3.0] = Field(3.0, const=True)

    def to_next(self) -> Optional[BaseModel]:
        logging.warning("================================================")
        logging.warning("============== Latest no migration =============")
        logging.warning("================================================")

        return None
Config templates, as shown here are versioned as well.

Let's run through various scenarios: 1. A user wants to run a training and uses the latest config, they run python example.py model=resnet_v3 2. A user wants to run a training and uses the latest config but overrides some values, they run python example.py model=resnet_v3 +model.zoom=25 3. A user wants to re-run a training from an old config, they run python example.py. They use the v2 version of config and migration fails because field context require value is greater than or equal to 512 in v3 but used config in v2 template is 256. The user received feedback on what needs to change in config to make it work. 4. the Same user in #3 wants to re-run a training from the old config, they run python example.py model.context=512. They can run and the application performs the migration and the application actually only Model3 version of object config. 5. I want to expose the schema of Model config, how can I do this:

        _Models = Annotated[Union[Model1, Model2, Model3], Field(discriminator="v")]
        _schema = schema_json_of(_Models, title="Model Schema", indent=2)
This produces the following schema for model config:
{
  "title": "Model Schema",
  "discriminator": {
    "propertyName": "v",
    "mapping": {
      "1.0": "#/definitions/Model1",
      "2.0": "#/definitions/Model2",
      "3.0": "#/definitions/Model3"
    }
  },
  "anyOf": [
    {
      "$ref": "#/definitions/Model1"
    },
    {
      "$ref": "#/definitions/Model2"
    },
    {
      "$ref": "#/definitions/Model3"
    }
  ],
  "definitions": {
    "Model1": {
      "title": "Model",
      "type": "object",
      "properties": {
        "v": {
          "title": "V",
          "const": 1.0,
          "enum": [
            1.0
          ],
          "type": "number"
        },
        "type": {
          "title": "Type",
          "type": "string"
        },
        "num_layers": {
          "title": "Num Layers",
          "minimum": 3,
          "type": "integer"
        },
        "width": {
          "title": "Width",
          "minimum": 0,
          "type": "integer"
        },
        "zoom": {
          "title": "Zoom",
          "default": 19,
          "exclusiveMinimum": 18,
          "type": "integer"
        }
      },
      "required": [
        "type",
        "num_layers"
      ]
    },
    "Model2": {
      "title": "Model",
      "type": "object",
      "properties": {
        "v": {
          "title": "V",
          "const": 2.0,
          "enum": [
            2.0
          ],
          "type": "number"
        },
        "type": {
          "title": "Type",
          "type": "string"
        },
        "num_layers": {
          "title": "Num Layers",
          "minimum": 3,
          "type": "integer"
        },
        "width": {
          "title": "Width",
          "minimum": 5,
          "type": "integer"
        },
        "zoom": {
          "title": "Zoom",
          "default": 19,
          "exclusiveMinimum": 18,
          "type": "integer"
        },
        "context": {
          "title": "Context",
          "default": 256,
          "minimum": 256,
          "type": "integer"
        }
      },
      "required": [
        "type",
        "num_layers",
        "width"
      ]
    },
    "Model3": {
      "title": "Model",
      "type": "object",
      "properties": {
        "v": {
          "title": "V",
          "const": 3.0,
          "enum": [
            3.0
          ],
          "type": "number"
        },
        "type": {
          "title": "Type",
          "type": "string"
        },
        "num_layers": {
          "title": "Num Layers",
          "minimum": 3,
          "type": "integer"
        },
        "width": {
          "title": "Width",
          "minimum": 0,
          "type": "integer"
        },
        "zoom": {
          "title": "Zoom",
          "default": 19,
          "exclusiveMinimum": 18,
          "type": "integer"
        },
        "new_context": {
          "title": "New Context",
          "default": 512,
          "minimum": 512,
          "maximum": 1300,
          "type": "integer"
        }
      },
      "required": [
        "type",
        "num_layers"
      ]
    }
  }
}

Conclusion

It's great to have tools like Hydra and Pydantic to simplify config management and deal with validations efficiently. It would be even great if Hydra integrates nicely with Pydantic as well but at least there is a path to using these two tools together.

Review and comparison of two manifold learning algorithms: t-SNE and UMAP

What are manifold learning algorithms? What is t-SNE and UMAP? What are the differences between t-SNE and UMAP? How can I explore the features of a dataset using t-SNE and UMAP? These are my notes from my recent exploration into t-SNE and UMAP and trying to apply them to a multi-label dataset to understand the abilities and limits of these algorithms.

This post is broken down into the following sections:

Manifold learning algorithms (MLA)

For us humans, high-dimensional data are very difficult to visualize and reason with. That's why we use dimensionality reduction techniques to reduce data dimensions so that the data is easy to work with and reason about. Manifold learning algorithms (MLA) are dimensionality reduction techniques that are sensitive to non-linear structures in data. The non-linearity is what sets manifold learning apart from other popular linear dimensionality reduction techniques like Principal Component Analysis (PCA) or Independent Component Analysis (ICA). Non-linearity allows MLAs to retain complex and interesting properties of data that would otherwise be lost in linear reduction/projection. Because of this property, MLA is a very handy algorithm to analyze data - to reduce data dimensions to 2D or 3D, and visualize and explore them to find patterns in datasets.

t-SNE (t-Distributed Stochastic Neighbor Embedding) and Uniform Manifold Approximation and Projection UMAP are the two examples of MLA, that I will cover in this post. I will compare and contrast them and provide a good intuition of how they work and how to choose one over the other.

Note that, MLAs are tweaks and generalizations of existing linear dimensionality reduction frameworks themselves. Similar to linear dimensionality reduction techniques, MLAs are predominantly unsupervised even though supervised variants exist. The scope of this post is unsupervised techniques, however.

So what does non-linearity buys us? The following shows the difference in reduction using PCA vs t-SNE, as shown in McInnes excellent talk:

Comparison of t-SNE and UMAP on MNIST dataset. (Image from McInnes talk)

As we can see, PCA retains some structure. However, it is very well pronounced in t-SNE, and clusters are more clearly separated.

Neighbour graphs

t-SNE and UMAP are neighbor graph technique that models data points as nodes, with weighted edges representing the distance between the nodes. Through various optimizations and iterations, this graph and layout are tuned to best represent the data as the distance is derived from the "closeness" of the features of the data itself. This graph is then projected on reduced dimensional space a.k.a. embedded space. This is a very different technique than matrix factorization as employed by PCA, ICA for example.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE uses Gaussian joint probability measures to estimate the pairwise distances between the data points in the original dimension. Similarly, the student's t-distribution is used to estimate the pairwise distances between the data points in the embedded space (i.e. lower dimension or target dimension). t-SNE then uses the gradient descent technique to minimize the divergence between the two distributions in original and embedded space using the Kullback-Leibler (KL) divergence technique.

Perplexity in t-SNE is effectively the number of nearest neighbors considered in producing the conditional probability measure. A larger perplexity may obscure small structures in the dataset while small perplexity will result in very localized output ignoring global information. Perplexity must be less than the size of data (number of data points) then; otherwise, we are looking at getting a blobby mass. The recommended range for perplexity lies between 5-50, with a more general rule that the larger the data, the larger the perplexity.

Because of the use of KL divergence, t-SNE preserves the local structure in the original space, however, global structure preservation is not guaranteed. Having said that, when initialization with PCA is applied, the global structure is somewhat preserved.

Talking more about structures, the scale of distances between points in the embedded space is not uniform in t-SNE as t-SNE uses varying distance scales. That's why it is recommended to explore data under different configurations to tease out patterns in the dataset. Learning rate and number of iterations are two additional parameters that help with refining the descent to reveal structures in the dataset in the embedded space. As highlighted in this great distill article on t-SNE, more than one plot may be needed to understand the structures of the dataset.

Different patterns are revealed under different t-SNE configurations, as shown by distill article. (Image from distill).

t-SNE is known to be very slow with the order of complexity given by O(dN^2) where d is the number of output dimensions and N is the number of samples. Barnes-Hut variation of t-SNE improves the performance [O(dN log N)] however Barnes-Hut can only work with dense datasets and provide at most 3d embedding space. The efficiency gain in Barnes-Hut is coming from changes in gradient calculation which are done with O(n log n) complexity, that uses approximation techniques which leads to about 3% error in nearest neighbor calculation.

Because of these performance implications, a common recommendation is to use PCA to reduce the dimension before applying t-SNE. This should be considered very carefully especially if the point of using t-SNE was to explore into non-linearity of the dataset. Pre-processing with linear techniques like PCA will destroy non-linear structures if present.

Uniform Manifold Approximation and Projection UMAP

UMAP is based on pure combinatorial mathematics that is well covered in the paper and is also well explained by author McInnes in his talk and library documentation is pretty well written too. Similar to t-SNE, UMAP is also a topological neighbor graph modeling technique. There are several differences b/w t-SNE and UMAP with the main one being that UMAP retains not only local but global structure in the data.

There is a great post that goes into detail about how UMAP works. High level, UMAP uses combinatorial topological modeling with the help of simplices to capture the data and applies Riemannian metrics to enforce the uniformity in the distribution. Fuzzy logic is also applied to the graph to adjust the probability distance if the radius grows. Once the graphs are built then optimization techniques are applied to make the embedded space graph very similar to the original space one. UMAP uses binary cross-entropy as a cost function and stochastic gradient descent to iterate on the graph for embedded space. Both t-SNE and UMAP use the same framework to achieve manifold projections however implementation details vary. Oskolkov's post covers in great detail the nuances of both the techniques and is an excellent read.

UMAP is faster for several reasons, mainly, it uses random projection trees and nearest neighbor descent to find approximate neighbors quickly. As shown in the figure below, similar to t-SNE, UMAP also varies the distance density in the embedded space.

Manifold reprojection used by UMAP, as presented by McInnes in his talk. (Image from McInnes talk)

Here's an example of UMAP retaining both local and global structure in embedded space:

Example of UMAP reprojecting a point-cloud mammoth structure on 2-D space. (Image provided by the author, produced using tool 1)

Here's a side-by-side comparison of t-SNE and UMAP on reducing the dimensionality of a mammoth. As shown, UMAP retains the global structure but it's not that well retained by t-SNE.

Side by side comparison of t-SNE and UMAP projections of the mammoth data used in the previous figure. (Image provided by the author, produced using tool 1)

The explanation for this difference lies in the loss function. As shown in the following figure, UMAP uses binary cross-entropy that penalizes both local (clumps) and global (gaps) structures. In t-SNE however, due to KL Divergence as the choice of the cost function, the focus remains on getting the clumps i.e. local structure right.

Cots function used in UMAP as discussed in McInnes talk. (Image from McInnes talk)

The first part of the equation is the same in both t-SNE (coming from KL divergence) and UMAP. UMAP only has the second part that contributes to getting the gaps right i.e. getting the global structure right.

Comparison table: t-SNE vs UMAP

Characteristics t-SNE UMAP
Computational complexity O(dN^2)
(Barnes-Hut with O(dN log N) )
O(d*n^1.14)
(emprical-estimates O(dN log N))
Local structure preservation Y Y
Global structure preservation N
(somewhat when init=PCA)
Y
Cost Function KL Divergence Cross Entropy
Initialization Random
(PCA as alternate)
Graph Laplacian
Optimization algorithm Gradient Descent (GD) Stochastic Gradient Descent (SGD)
Distribution for modelling distance probabilities Student's t-distribution family of curves (1+a*y(2b))-1
Nearest neighbors hyperparameter 2^Shannon entropy nearest neighbor k

Exploring dataset with t-SNE & UMAP

Now that we have covered theoretical differences, let's apply these techniques to a few datasets and do a few side-by-side comparisons between t-SNE and UMAP.

Source code and other content used in this exercise are available in this git repository - feature_analysis. The notebook for MNIST analysis is available here. Likewise, the notebook for CIFAR analysis is available here.

Simple feature dataset like MNIST

The following figure reprojects MNIST image features on the 2D embedded space under different perplexity settings. As shown, increasing the perplexity, makes the local clusters very packed together. In this case, PCA based initialization technique was chosen because I want to retain the global structure as much as possible.

Reprojection of MNIST image features on the 2D embedded space using t-SNE under different perplexity settings. (Image provided by author)

It's quite interesting to see that the digits that are packed close together are: 1. 7,9,4 2. 3,5,8 3. 6 & 0 4. 1 & 2 5. That 6 is more close to 5 than 1 6. Likewise 7 is closer to 1 than 0.

At a high level this co-relation makes sense, the features of the digits that are quite similar are more closely packed than digitals that are very different. Its also, interesting to note that 8 and 9 have anomalies that map closely to 0 in rare cases in embedded space. So what's going on? Following image overlays randomly selected images on the clusters produced by t-SNE @ perplexity of 50.

Reprojection of MNIST image features on the 2D embedded space using t-SNE @ perplexity=50 with randomly selected image overlay. (Image provided by author)

As we pan around this image, we can see a distinct shift in the characteristics of the digits. For example, the 2 digits at the top are very cursive and have a round little circle at the joint of the two whereas as we travel to the lower part of the cluster of 2s, we can see how sharply written the bottom 2s are. The bottom 2s features are sharp angular joints. Likewise, the top part of the cluster of 1s is quite slanty whereas the bottom 1s are upright.

It's quite interesting to see that the 8's occasionally clustered together with 0's are quite round in the middle and do not have the sharp joint in the middle.

So, what does MNIST data look like with UMAP? UMAP's embedded space also reflects the same grouping as discussed above. In fact, UMAP and t-SNE clustering in terms of digits grouping are very much alike. It appears to me that UMAP and t-SNE are mirror reflections when it comes to how digits are grouped.

Reprojection of MNIST image features on the 2D embedded space using UMAP. (Image provided by author)

It's also very interesting to note how similar-looking 1s are that are reprojected to the same coordinates in the embedded space.

One example of samples that get reprojected to the same coordinates in the embedded space using UMAP. (Image provided by author)

Not all the data points that collide in embedded space will look exactly similar, the similarity is more in the reduced dimensional space. One such example is shown below. Here 1 and 0 are reprojected to the same coordinates. As we can see the strokes on the left side of 0 are very similar to strokes of 1. The circle of zero is not quite complete either.

One example of two different digits getting reprojected to the same coordinates in the embedded space using UMAP. (Image provided by author)

Here's also an example of samples falling into the same neighborhood in the embedded space that look quite distinct despite sharing some commonality (the strokes around the mouth of 4 and incomplete 8s)!

Example of 4 and 8s reprojected to the nearby coordinates in the embedded space using UMAP. (Image provided by author)

Its unpredictable what tangible features have been leveraged to calculate the similarities amongst data points in the embedded space. This is because the main focus of MLAs has been distance measures and the embedded space is derived based on best effort using unsupervised techniques with evident data loss (due to dimensionality reduction).

This was MNIST, where digits are captured with empty backgrounds. These are very easy cases because all the signals in the feature vectors are true signals that correspond to the digit as its drawn. When we start talking about visualizing data where there are noises in the signals then that case poses certain challenges. For example, taking the case of cifar dataset, the images of things are captured with a varying background as they are all-natural images unlike MNIST with a black background. In the following section, let's have a look at what happens when we apply t-SNE and UMA to the CIFAR dataset.

High-level differences between cifar and MNIST dataset. (Image provided by author)

More complex datasets like CIFAR

The following figure shows the resultant embedding of CIFAR images after applying UMAP. As shown below results are less than impressive to delineate amongst CIFAR classes or perform any sort of features analysis.

Results of CIFAR image feature visualization using t-SNE under different perplexity settings. (Image provided by author)

So, what's going on? Let's overlay the images and see if we can find some patterns and make sense of the one big lump we are seeing. The following figure overlays the image. It's really hard to find a consistent similarity between neighboring points. Often we see cars and vehicles nearby but not consistently. They are intermixed with flowers and other classes. It's simply too much noise in the feature vector to do any meaningful convergence.

Results of CIFAR image feature visualization using t-SNE. Shows images in an overlay on randomly selected points. (Image provided by author)

In the above two figures, we looked at analyzing CIFAR with t-SNE. The following plot is produced by using UMAP. As we can see it's not convincing either. Much like t-SNE, UMAP is also providing one big lump and no meaningful insights.

Results of CIFAR image feature visualization using UMAP. (Image provided by author)

Following show images of 2 cats that are projected to the same location in embedded space. There is some similarity between the two images like nose, and sharp ears obviously but also the two images have varying distinct features.

Results of CIFAR image feature visualization using UMAP showing samples of cats that are reprojected into the same located in the embedded space. (Image provided by author)

Likewise, if we look at the following figure where deer and frog are co-located in embedded space, we can see the image texture is very similar. This texture however is the result of normalization and grayscale conversion. As we can see, a lot goes on in nature scenes and without a clear understanding of which features to focus on, one's features can be other's noise.

Results of CIFAR image feature visualization using UMAP. Shows images in an overlay on randomly selected points. (Image provided by author)

t-SNE and UMAP are feature visualization techniques and perform best when the data vector represents the feature sans noise.

What to do when there is noise in features?

So, what can we do if there are noises in our feature vectors? We can apply techniques that reduce noises from feature vectors before applying manifold learning algorithms. Given the emphasis on nonlinearity in both t-SNE and UMAP (to preserve nonlinear features), it is better to choose a noise reduction technique that is nonlinear.

Autoencoder is a class of unsupervised deep learning techniques that learns the latent representation of the input dataset eliminating noises. Autoencoder can be non-linear depending on the choices of layers in the network. For example, using a convolution layer will allow for non-linearity. If noises are present in the feature vector then an autoencoder can be applied to learn latent features of a dataset and to transform samples to noise-free samples before applying manifold algorithms. UMAP has native integration with Tensorflow for similar use cases that is surfaced as parametric UMAP. Why parametric because autoencoders/neural networks are parametric! i.e. increasing data size will not increase parameters - the parameters may be large but will be fixed and limited. This approach of transforming input feature vectors to the latest representation not only helps with noise reduction but also with complex and very high dimensional feature vectors.

The following figure shows the results of applying autoencoder before performing manifold algorithm t-SNE and UMAP for feature visualization. As we can see in the result, the clumps are much more compact and the gaps are wider. The proximity of MNIST classes remains unchanged, however - which is very nice to see.

Results of applying autoencoder on MNIST before applying manifold algorithm t-SNE and UMAP. (Image provided by author)

So how does it affects the features that contribute to proximities/neighborhood of data? The manifold algorithm is still the same however it's now applying on latent feature vector as produced by autoencoders and not raw features. So effectively, the proximity factor is now calculated on latent representation and not directly on perceptible features. Given the digit clusters are still holding global structure and it's just more packed together within the classes, we can get the sense that it's doing the right things if intra-class clumps can be explained. Let's look at some examples. The following shows a reprojection in the embedded space where 4,9 and 1 are clustered together into larger clusters of 1s. If we look closely the backbone of all these numbers is slanting at about 45 degrees and perhaps that has been the main driving factor. The protruding belly of 4 and 9 are largely ignored but also they are not very prominent.

Example of co-located 1s, 4 & 9 in embedded space obtained by applying Paramertic UMAP on MNIST. (Image provided by author)

More importantly, looking at the dataset (in the following figure), there are not as many 9s or 4s that have a backbone at that steep slant of 45 degrees. This is more easily shown in the full-scale overlay in the following figure (sparsely chosen images to show to make it more comprehensible). We can see all samples are upright 4s and 9s where there is a slant the protrusion is more prominent. Anyhow, we don't need to get overboard with this as manifold algorithms are not feature detection algorithms and certainly can't be compared to the likes of more powerful feature extraction techniques like convolution. The goal with manifold is to find global and local structures in the dataset. These algorithms work best if signals in datasets are noise-free where noise includes features/characteristics we want ignored in the analysis. A rich natural background is a very good case of noise as shown already in the CIFAR case.

Images overlaid on t-SNE of auto-encoded features derived from MNIST dataset. (Image provided by author)

how to do this for multi-label

In the early day of learning about this, I found myself wondering how would we do the feature analysis on a multi-label dataset? Multi-class is certainly the easy case - each sample only needs to be color-coded for one class so is easy to visualize. We are also expecting class-specific data to be clumped together mostly (perhaps not always and depends on intent but may be more commonly).

If multi-label data consists of exclusive classes similar to Satnford Cars Dataset where options within the make and the models of cars are mutually exclusive, then splitting the visualization to the group of exclusive cases could be very helpful.

However, if the multi-label dataset is more alike MLRSNet where classes are independent then it's best to first analyze the data class agnostic and explore if there are any patterns in features and proceed based on this.

Can we apply this to understand what neural networks are doing?

A lot of work has been done in the area of explainability and feature understanding that is very well documented in distill blogs. The underlined idea is that we can take the activation of the layers of the neural network and explore what features it is that that particular layer is paying attention to. The activations are essentially the signal that is fired for a given input to the layer. These signals then formulate the feature vector for further analysis to understand where and what the layer is paying more attention to. T-SNE and UMAP are heavily used in these analyses.

The distill blogs are very well documented and highly recommended for reading if this is something that is of interest to you.

Conclusion

This post was focused on the fundamentals of manifold learning algorithms, and diving into the details of t-SNE and UMAP. This post also compared and contrasted t-SNE and UMAP and presented some analysis of MNIST and CIFAR datasets. We also covered what to do if we have a very high dimensional dataset and also if we have noises in the dataset. Lastly, we touched on what to do if your dataset is multi-label.

In the follow-up, I will cover how we can utilize t-SNE and UMAP to better understand what neural networks are doing and apply it in conjunction with convolutions as feature extractors.

[umap_doco] https://umap-learn.readthedocs.io/en/latest/how_umap_works.html

Confident Learning: Label errors are imperative! So what can you do?

What makes deep-learning so great, despite what you may have heard, is data! There is an old saying, that sums it up pretty well:

The model is only as good as the data!

Which brings us to the real question, exactly how good is your data? Collecting ground truth/training datasets is incredibly expensive, laborious, and time-consuming. The process of labeling involves searching for the object of interest and applying prior knowledge and heuristics to come to a final decision. A decision to represent if the object (of interest) is present and if so, annotate it to provide extra information like bounding box, and segmentation mask.

There are several factors at play in this process that leads to error in the dataset. The question is what can we do about it? In this post, I will touch on some of the reasons why labeling error occurs, why the errors in labels are imperative, and what are the tools and techniques one can use to manage errors in the label datasets. Then, I will dive into the detail of confident learning. I will also demonstrate the use of cleanlab, a confident learning implementation [1], to easily find noise in the data.

Disclaimer: Before diving into the details, I would like to acknowledge that, at the time of writing this post, I am not affiliated with or sponsored by cleanlab in any capacity.

This post is broken down into the following sections:

Reasons for labeling errors

Let's first look at some of the reasons why errors in the labels may be present. One broad class for such errors is tooling/software bugs. These can be controlled and managed using good software best practices like tests covering both software and the data. The other class of errors is the one where the mistakes are coming from the labelers themselves. These are incredibly harder issues to track. Because labelers are the oracle in this process of deep learning after all! If we don't trust them, then who do we trust?

There is a very interesting work by Rebecca Crowley where she provides a detailed chart of a range of reasons why an object (of interest) may be missed while searching in a scene or also why a wrong final decision may be made by them [4]. Some directly impacting labeling in my view are:

  1. Search Satisficing: The tendency to call off a search once something has been found, leading to premature stopping, thus increasing the chance of missing annotations[4]. This more applies to scenarios where more than one annotation is needed. For example, multi-label or segmentation annotation (dog and pen are in the image but the labeler only annotates for the dog and does not spend enough time to spot the pen and capture in labels).
  2. Overconfidence & Under-confidence: This type of labeling error relates to one’s feeling-of-knowing [4] that is over or underestimated.
  3. Availability: There is an implicit bias in annotating it wrongly if something is frequently occurring or rarely occurring [4]. It is particularly true for challenging labeling tasks. For instance, if the cancer prevalence rate in a location is 0.01%, then the labeler, when labeling a not-so-straightforward case, is more likely to mark a non-cancer than cancer.
  4. Anchoring/Confirmation bias: When a labeler makes a pre-emptive decision [4] about a labeling task outcome and then looks for information to support that decision. For example, believing they are looking at a cancerous case, they start to search for abnormality in the image to support the finding that this case is cancer. In this unfair search/decision process, they are more likely to make mistakes.
  5. Gambler’s Fallacy: When they are encountered with a repeated pattern of similar cases, then they are likely to deviate and favor an outcome that breaks that pattern [4].
  6. Amongst all these, Cognitive Overload is also a valid and fair reason for errors in labels.

Labeling efficiency tricks that increase the risk of labeling errors

Given the process to procure a labeling dataset is expensive, some clever tricks and techniques are occasionally applied to optimize the labeling funnel. While some focus on optimization through labeling experience such as Fast Interactive Object Annotation[5], other techniques focus on using auto-labeling techniques to reduce the labeling burden a bit to aid the labelers. Tesla has a very powerful mechanism to realize auto-labeling at scale as talked about in the 2021 CVPR by Karpathy. They sure have the advantage of having feedback (not just the event that's worth labeling but also what did the driver do or if that led to a mishap). It's an advantage that not all deep-learning practicing organizations have! Impressive nonetheless! Then we also have the weakly supervised class of training regimes that I won't go into detail about (perhaps a topic for another day)!

The thing is, the cleverer tricks you employ to optimize this process, the higher the chances of error in your dataset. For instance, using the model in the loop for labeling is being increasingly used to optimize the labeling process (as detailed in this presentation), as an auto-labeling/pre-label trick, wherein predicted labels are shown to the labelers and the labeler only fine-tunes the annotation instead of annotating from the get-go. This process has two main challenges:

a) It may introduce a whole gamut of bias in labels as discussed above in (Reasons for labeling errors.

b) If the model is not on par with human labelers, the labor, and boredom of correcting a garbage prior label increase the risk of errors in the dataset. This is a classic case of cognitive overload.

If this example was not enough, let's take a case of auto labeling using ontology/knowledge graph. Here, the risk of error propagation is too high if the knowledge encoded in the ontology/knowledge graph is biased. For example, if it's an ocean water body it cants be a swimming pool. Because well, contrary to common knowledge ocean pools do exist. Or if it's a house it's not a waterbody - because you know lake houses do exist!

Ocean Pool/Rock Pool @Mona Vale NSW AU (Image is taken from the internet! @credit: unknown)

Errors in labels are imperative

Given the challenges discussed so far, It is fair to say that errors are imperative. The oracle in this process are labelers and they are only just human!

I am not perfect; I am only human

Evidently, there is an estimated average of at least 3.3% errors across the 10 popular datasets, where for example label errors comprise at least 6% of the ImageNet validation set 2.

Assuming, human labelers will produce a perfectly clean dataset is, well, overreaching to say the least. If one cares, one can employ multiple labelers to reduce the error by removing the noise through consensus. This is a great technique to produce a high-quality dataset but it's also many times more expensive and slow thus impractical as standard modus-operandi. Besides being expensive, this does not guarantee a clean dataset either. This is evident from Northcutt's NeurIPS 2021[2] work on analyzing errors in the test set of popular datasets that reported order of hundred samples across popular datasets where an agreement could not be reached on true ground truth despite looking at collating outcomes from labelers (see table 2 in the paper for reference).

For the last few years, I have been using model-in-loop to find samples where there are disagreements in the dataset with the models. Some of the other techniques that I have found useful are leveraging loss functions, as called out by Andrej Karpathy! More recently, I have seen a huge benefit of deploying ontology-based violations to find samples that are either a) labeling errors or b) extreme edge cases that we were not aware of (aka out of distribution [OOD] samples).


However, I realized I have been living in oblivion when I came across project cleanlab from Northcutt's group and started digging in a bunch of literature around this exact space! I was excited to uncover all the literature that proposed the tricks and tips I have been using to date without being aware of them. Well no more!! In the following sections, I will cover what I have learned reading through this literature.

Exactly, what is Confident learning?

All models are wrong, but some are useful!

Confident learning[1] (CL) is all about using all the useful information we have at hand to find noise in the dataset and improve the quality of the dataset. It's about using one oracle (the labelers) and testing it using another oracle in the build i.e. the model! At a very high level, this is exactly what we have been more generally calling model-in-the-loop!

Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality!1

Confident learning (CL) is a class of learning where the focus is to learn well despite some noise in the dataset. This is achieved by accurately and directly characterizing the uncertainty of label noise in the data. The foundation CL depends on is that Label noise is class-conditional, depending only on the latent true class, not the data 1. For instance, a leopard is likely to be mistakenly labeled as a jaguar. This is a very strong argument and in practice untrue. I can argue that the scene (ie the data) has implications for the labeler's decisions. For instance, a dingy sailing in water is less likely to be labeled as a car if it's in water. In other words, would not a dingy being transported in a lorry is more likely to be labeled as a car than when it was sailing on the water? so data and context do matters!
This assumption is a classic case of, in statistics any assumption is fair if you can solve for x!. Now that we have taken a dig at this, it's fair to say that this assumption is somewhat true even if not entirely.

Another assumption CL makes is that rate of error in the dataset is < 1/2. This is a fair assumption and is coming from Angluin & Laird's[3] proposed method for the class conditional noise process which is used in CL to calculate the joint distribution of noisy and true labels.

In Confident learning, a reasonably well performant model is used to estimate the errors in the dataset. First, the model's prediction is obtained then, using a class-specific threshold setting confident joint distribution matrix is obtained; which is then normalized to obtain an estimation of the error matrix. This estimated error matrix then builds the foundation for dataset pruning, counting, and ranking samples in the dataset.

Estimating the joint distribution is challenging as it requires disambiguation of epistemic uncertainty (model-predicted probabilities) from aleatoric uncertainty (noisy labels) but useful because its marginals yield important statistics used in the literature, including latent noise transition rates latent prior of uncorrupted labels, and inverse noise rates. 1.

Confident learning in pictures1.

The role of a class-specific threshold is useful to handle variations in model performance for each class. This technique is also helpful to handle collisions in predicted labels.

What's great about Confident learning and cleanlab (its python implementation of it) is that while it does not propose any groundbreaking algorithm, it has done something that is rarely done. Bring together antecedent works into a very good technical framework that is so powerful that it rightfully questions major datasets that shape the evolution of deep learning. Their work on the analysis of 10 popular datasets including MNIST is well appreciated. This is also a reminder that we are massively overfitting the entire deep learning landscape to datasets like MNIST, and ImageNet, as they are pretty much, must use the dataset to benchmark and qualify bigger and better algorithms and that, they have at least 3.3% errors if not 20% (as estimated by Northcutt's group)!

This approach was used in side by side comparison where samples were pruned either randomly (shown in orange in below fig) or more strategically via CL to remove noisy samples only (shown in blue below fig). The accuracy results as shown below. CL is doing better than random prune!

Borrowed from Confident learning1

Following are an example of some of the samples that CL flags as noisy/erroneous, also including the edge cases where a) neither were true and b) Its either the CL suggestion or provided label but unclear which of the two are correct (non-agreement):

Examples of samples corrected using Confident learning2.

CL is not 100% accurate. At the best, it is an effort to estimate noise using an epistemic uncertainty predictor [the model]. Having said that these examples it fails on are quite challenging as well.

Examples of samples Confident learning struggled with!2

Fundamentals

Let's look at the fundamentals CL builds on and dig in a bit on the antecedent and related works that formulate the foundation upon which Confident learning(cleanlab) is built!

  1. As discussed above already, is class conditional noise proposed by Angluin & Laird is the main concept utilized by CL.

  2. The use of weighted loss functions to estimate the probability of misclassification, as used in 7, is quite relevant to the field of CL, although it's not directly used in the CL technique as proposed by the Confident learning framework paper.

  3. Use of iterative noise cross-validation [INCV] techniques, as proposed in Chen et. al[8], that utilize a model-in-the-loop approach to finding noisy samples. This algorithm is shown below:

    Chen's INCV algorithm

  4. CL does not directly use MentorNet[6]. However, it is very relatable work that builds on a data-driven training approach. In simplistic terms, there is a teacher network that builds a curriculum of easy to hard samples (derived using loss measures) and the student is trained off this curriculum.

    MentorNet architecture

  5. There are other variations to data-driven teaching, one noticeable example is Co-teaching[9] which looks to be teaching each other in pairs and passing the non-noisy samples (calculated based on loss measures) to each other during learning. The main issue Co-teaching tries to solve is the memorization effects that are present in MentorNet[6]. (aka M-Net] but not in Co-teaching[9] due to data share.

    Co-teaching approach in contrast with MentorNet aka M-Net

  6. The understanding that small losses are a good indicator of useful samples and highly likely correct samples whereas jumpy, high loss producing samples are, well, interesting! They can be extreme edge cases or out-of-distribution samples or also can indicate wrongness. This only holds for reasonably performant models with abilities at least better than random chance if not more!

Limitations of CL

One thing that CL entirely ignores about label errors is when one or more true label is entirely missed. This is more likely to happen in multi-label settings and even more complex labeling setting like segmentation or detection than in multi-class classifications. For example, if there is a TV and water bottle in an image and the only annotation present is for the TV, and the water bottle is missed entirely! This is currently not modeled in CL as relies on building a pairwise class-conditional distribution between the given label and the true label. If the given label was a cat but the actual label was the dog for example. The framework itself does not allow it to model for the missing label when a true label is present.

Pairwise class-conditional distribution as proposed in CL also limits its use in multi-label settings. Explicitly when multiple labels can co-exist on the same data. For example, both the roof and swimming pool can be present in the image and they are not necessarily exclusive. This limitation comes from pairwise (single class gives vs single class predicted) modeling. This is different from say Stanford Car Dataset where make and model are predicted as multiple labels but they are exclusive ie a vehicle can be ute or hatchback but that is exclusive to it being made by Toyota or Volvo. Exclusivity in this case allows for modeling Pairwise class-conditional distribution. These are more multi-label multi-class modeled using multi-headed networks. True multi-label datasets can't be modeled like these pairwise joint distributions. They can perhaps be modeled using more complex joint distribution formulations but that would not scale very well. Confident learning[1] as it stands currently is a computationally expensive algorithm with an order of complexity being O(m2 + nm).

Having said these, it is probably something that is not explained in the literature as it seems that cleanlab itself supports multi-label classification. Conceptually I am unclear how multi-label works given the said theory. The support for multi-label is developing as issues are being addressed on this tool 1,2.

Hands on of label noise analysis using cleanlab for multi-label classification

Now that we have covered the background and theoretical foundations, let's try the confident learning out in detecting label noise using its implementation cleanlab. Specifically, I will use multi-label classification given the uncertainty around it (as discussed above)!. For this hands-on, I will use MLRSNet dataset. This spike is built using ResNet as a multi-label image classifier predicting 6 classes that can co-exist at the same time. These classes are ['airplane', 'airport', 'buildings', 'cars', 'runway', 'trees'] and derived off MLRSNet subset - i.e. only using airplane and airport.

The source code and this notebook for this project is label-noise Github repository. The notebook checks how well cleanlab performs in detecting label noise and identifies out-of-distribution samples, weird samples, and also errors!

Model info

Train and validation loss from the model trained using label-noise for 19 epochs. Note, for this spike, I opted out of n-fold cross-validation (CV). Using CV can lead to better results but its a not quick.

Training logs for the model used in this exercise (Image provided by author)

Exploration in noise using cleanlab

As shown in this notebook, the filter approach provided a list of samples that were deemed noisy. This flagged 38% of out-of-samples as noisy/erroneous.

Ranking provided an order in the samples on a 0-1 scale for label quality, the higher the number, the better quality the sample. I looked in get_self_confidence_for_each_label approach for out-of-distribution (OOD) samples and also entropy-based ordering. The distribution for each is as follows:

Confidance order of entropy-based ordering

Confidance order of self_confidence ordering

The samples picked as lowest quality or confidence is indeed rightfully chosen. There is an airplane that was missed in the image (shown in the highlighted overlay) and also another sample is OOD!

What's more interesting is all three methods of filtering, ranked by self-confidence and entropy all flagged these two samples! So we increase the threshold, and while there are false positives (for noise) there is some good example of errors.

False Positive True Positive

I am not sure if I am following how to interpret the pair-wise count for multi-label, so that's left for another day!

Conclusion

As discussed in this post, there are several reasons why label errors are unavoidable. While no small-efforts are required to minimize the errors in the dataset, management of the errors in the dataset is also warranted. The management to find noisy data, out of distribution data, or data that represents a systematic flaw (software/tooling issues or issues in the understanding of a concept that defines the target class). Approaches like model-in-loop, or additional information like ontology to find such noises or errors in the dataset are effective techniques. Confident learning provides a solid foundation for analyzing a dataset of noisy or OOD samples — a technique that’s quite effective for multi-class approaches, with the evolving support for multi-label classification. Now, on to cleaning the dataset! Happy cleaning!

References

  1. C. G. Northcutt, L. Jiang, and I. Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021. 1911.00068
  2. Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. In International Conference on Learning Representations Workshop Track (ICLR). 2103.14749
  3. D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988. link
  4. Crowley RS, Legowski E, Medvedeva O, Reitmeyer K, Tseytlin E, Castine M, Jukic D, Mello-Thoms C. Automated detection of heuristics and biases among pathologists in a computer-based system. Adv Health Sci Educ Theory Pract. 2013 Aug;18(3):343-63. doi: 10.1007/s10459-012-9374-z. Epub 2012 May 23. PMID: 22618855; PMCID: PMC3728442. link
  5. Huan Ling and Jun Gao and Amlan Kar and Wenzheng Chen and Sanja Fidler, Fast Interactive Object Annotation with Curve-GCN. CVPR, 2019 link
  6. Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. (2018). Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning (ICML).1712.05055.
  7. N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Conference on Neural Information Processing Systems (NeurIPS), pages 1196–1204, 2013. NurIPS 2013
  8. P. Chen, B. B. Liao, G. Chen, and S. Zhang. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning (ICML), 2019.1905.05040
  9. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Conference on Neural Information Processing Systems (NeurIPS). 1804.06872

Review of recent advances in dealing with data size challenges in Deep Learning

The energy and excitement in machine learning and deep learning communities are infectious these days. So many groundbreaking advances are happening in this area but I have often found myself wondering why the only thing that makes it all shine - yes I am talking about the dark horse of deep learning the data is so underappreciated. The last few years of DL research have given me much joy and excitement and I carry hope now that going forward we can see some exciting progress in this space that explore advances in deep learning in conjunction with data! In this article, I summarise some of the recent developments in the deep learning space that I have been blown away by.

Table of content of this article: - Review of recent advances in dealing with data size challenges in Deep Learning - The dark horse of deep learning: data - Labelled data: the types of labels - Commonly used DL techniques centered around data - Data Augmentation - Transfer Learning - Dimensionality reduction - Active learning - Challenges in scaling dataset for deep-learning - Recent advances in data-related techniques - 1. Regularization - 1.1 Mixup - 1.2 Label Smoothing - 2. Compression - 2.1. X-shot learning: How many are enough? - 2.2. Pruning - 2.2.1 Coresets - 2.2.2 Example forgetting - 2.2.3 Using Gradient norms - 2.3. Distillation - 3. So what if you have noisy data - Conclusion - References

The dark horse of deep learning: data

Deep learning (DL) algorithms learn to perform a task by building a (domain) knowledge representation by looking at the training data. An early study of image models (classification and segmentation, year 2017) noted that the performance of the model increases logarithmically as the training dataset increases 1. The belief that increasing training dataset size will continue to increase model performance has been long-held. This has also been supported by another empirical study that validated this belief across machine translation, language modeling, image classification, and speech recognition 2 (see figure 1).

*Figure 1: Shows relationship between generalization error and dataset size (log scale) 2 *

So, the bigger dataset is better right? Almost! A theoretical foundation has been laid out in the form of power-law i.e $ \begin{equation} \label{power_law} ε(m) \approx αm^{β_g} \end{equation} $ wherein ε is generalization error, m is the number of samples in the training set, α is a constant property of the problem/DL task, and βg is the scaling exponent that defines the steepness of the learning curve. Here, βg, the steepness of the curve depicts how well a model can learn from adding more data to the training set 2 .(see figure 2) Empirically, βg was found to be between −0.07 and −0.35 despite theory suggesting βg to be 0.5 or 1. Nonetheless, the logarithmic relationship holds. As shown in figure 2, the gain eventually tapers in irreducible error.

*Figure 2: Power Law curve showing model trained with a small dataset only as good as random guesses to rapidly getting better as dataset size increases to eventually settling into irreducible error region explained by a combination of factors including imperfect data that cause imperfect generalization 2 *

This can be attributed to several factors including imperfection in data. The importance of data quality and continually iterating over is touched on in some of the previous talks 1, 2, 3. Data quality matters, and so does the data distribution. The better the distribution of the training dataset is, the more generalized the model can be!


Data is certainly the new oil! 3


So, can we scale the data size without many grievances? Keep in mind, 61% of AI practicing organizations already find data and data-related challenges as their top challenge 4. If the challenges around procurement, storage, data quality, and distribution/demographic of the dataset has not subsumed you yet, this post focuses on yet another series of questioning. How can we train efficiently when data volume grows and the computation cost and turnaround time increase linearly with data growth? Then we begin asking how much of the data is superfluous, which examples are more impactful, and how do we find them? These are very important questions to ask given a recent survey 4 noted that about 40% of the organizations practicing AI already spend at least $1M per annum on GPUs and AI-related compute infrastructures. This should concern us all. Not every organization beyond the FAANG (and also, the one's assumed in FAANG but missed out on acronym!) and ones with big fat balance sheet will be able to leverage the gain by simply scaling the dataset. Besides, this should concern us all for environmental reasons and carbon emissions implications more details.

The carbon footprint of training a single AI is as much as 284 tonnes of carbon dioxide equivalent — five times the lifetime emissions of an average car source.

The utopian state of simply scaling training datasets and counting your blessings simply does not exist. The question then is, what are we doing about it? Unfortunately not a whole lot especially if you look at the excitement in the ML research community in utilizing gazillion GPU years to gain a minuscule increase in model performance attributed to algorithmic or architectural changes. But the good news is this area is gaining much more traction now. Few pieces of research since 2020 are very promising albeit in their infancy. I have been following the literature around the use of data in AI (a.k.a data-centric AI) topic very closely as this is one of my active areas of interest. I am excited about some of the recent developments in this area. In this post, I will cover some of my learnings and excitement around this topic.

Before, covering them in detail, let's review foundational understanding and priors first:

Labelled data: the types of labels

This post focuses heavily on supervised learning scenarios focussing mainly on computer vision. In this space, there are two types of labels: - Hard labels - Soft labels

Traditional labels are hard labels where the value of ground truth is a discrete value e.g. 0 and 1, 0 for no, and 1 for yes. These discrete values can be anything depending on how the dataset was curated. It's important to note that these values are absolute and unambiguously indicate their meaning.

There is an emerging form of labels known as soft labels where ground through represents the likelihood. By nature these labels are continuous. An example, a pixel is 40% cat 60% dog. It will make a whole lot of sense in the following sections.

Commonly used DL techniques centered around data

Data augmentation and transfer learning are two commonly used techniques in deep learning these days that focus on applying data efficiently. Both these techniques are heavily democratized now and commonly applied unless explicitly omitted.

Data Augmentation

Data augmentation encompasses a variety of techniques to transform a datapoint such that it adds variety to the dataset. The technique aims to keep the data distribution about the same but add richness to the dataset by adding variety. Predominantly, the transformation via this technique has been intra-sample. Affine transformations, contrast adjustment, jittering, or color balancing are some such examples of data augmentation techniques. Imgaug and kornia are very good libraries for such operation even though all ML frameworks offer a limited set of data augmentation routines.

Data augmentation technique was initially proposed to increase robustness and achieve better generalization in the model but they are also used as a technique to synthetically increase data size as well. This is especially true in cases where data procurement is really challenging. These days, data augmentation techniques have become a lot more complex and richer including scenarios where-in model-driven augmentations may also be applied. One example of this is GAN-based techniques to augment and synthesize samples. In fact, data augmentation is also one of the techniques to build robustness against adversarial attacks.

* Example of augmentation src *

Transfer Learning

Transfer learning is a very well democratized technique as well that stems from reusing the learned representations into a new task if the problem domain of two tasks is related. Transfer learning relaxes the assumption that the training data must be independent and identically distributed (i.i.d.) with the test data 5, allowing one to solve the problem of insufficient training data by bootstrapping model weights from another learned model trained with another dataset.

*Figure 3: Training with and without transfer learning 6 *

With transfer learning, faster convergence can be achieved if there is an overlap between the tasks of the source and target model.

Dimensionality reduction

Dimension reduction techniques are also applied to large datasets:

These techniques are categorized into two: 1. Ones that seek to preserve the pairwise distance amongst all the samples in the dataset. Principal component analysis (PCA) is a good example of this. 2. Ones that preserve the local distances over global distance. The techniques like uniform manifold approximation and projection (UMAP) 23 and t-distributed stochastic neighbor embedding (t-SNE) 24 fit in this category. UMAP arguably preserves more of the global structure and is algorithmically faster than t-sne. Both T-SNE and UMAP use gradient descent for arriving at the optimal embeddings.

These techniques in DL space however are mostly used to understand the data and also for visualization purposes. UMAP and T-SNE do better at preserving global structure than other embedding algorithms but are limited. This blog covers the topic more in detail.

Active learning

Active learning is a methodology wherein the training process proactively asks for labels on specific data. It is used more commonly in classical ML techniques, but it has not been very successful in deep learning due to back-propagation. Offline or pool-based active learning has been investigated heavily for use in deep learning but without much groundbreaking success. The use of active learning is not very straightforward either due to the negative impact of outliers on training 25. Pool-based active learning will be covered in the following section in more detail (pruning).

Challenges in scaling dataset for deep-learning

Besides the techniques discussed in previous section, not a lot of investment has been done in the area focussing on data-centric AI. The momentum around data-centric AI is forming a bit recently with Andrew Ng driving data-centric AI efforts through his new startup Landing.AI.

In my view, the following are some of the broad categories of questions that fall under the purview of data-centric AI:

  1. How to efficiently train with the rapid increase in the dataset? Yann LeCun called out in his interview with Soumith Chintala during PyTorch developer day 2021 that training time of more than 1 week should not be acceptable. This is a very good baseline for practical reasons but if one does not have an enormous GPU fleet at their disposal then this is goalpost is very hard to achieve given current DL practices. So, what else can be done to train efficiently with increased dataset size?
  2. Are all samples equally important? How important a sample in the dataset is? Can we leverage the "importance factor" for good?
  3. What role does a sample play towards better generalization? Some samples carry redundant features, so how to deduplicate the dataset when features as in DL are not explicit?
  4. Data size matters but can we be strategic about what goes in the dataset?
    1. Cleverly doing this has to do with efficient sampling and data mining techniques. These are the easily solved problem if and only if we know what our targets are. Challenge in DL, as I see it, is what to look for to mine for the best sample? This is not well understood.
  5. How can we leverage more innate DL techniques like objective functions, backpropagation, and gradient descent to build a slick and effective dataset that provides the highest return on investment.
  6. Noises in datasets are seen as evil. But are they always evil? How much noise can one live with? How to quantify this?
  7. How much of a crime it is if data bleeds across traditional train/validate/calibrate/test splits.
    1. What are the recommendations on the data split for cascade training scenarios?
  8. How fancy can one get with data augmentation before returns start to diminish?
  9. How to optimize the use of data if continual learning is observed?

If we look at humans are learning machines, they have infinite data at their disposal to learn from. Our system had evolved into efficient strategies to parse through infinite data streams to select the samples we are interested in. How our vision system performs foveal fixation utilizing saccadic eye movements to conduct efficient subsampling of interesting and useful datapoint should be a good motivation. Sure we fail sometimes, we fail to see the pen on the table even though it's right in front of us due to various reasons but we hit it right most of the time. Some concepts of Gestalt theory, a principle used to explain how people perceive visual components (as organized patterns, instead of many disparate parts) are already applicable for better selection of data even if machine models are stochastic parrots. According to this theory, eight main factors, listed below, determine how the visual system automatically groups elements into patterns.

  1. Proximity: Tendency to perceive objects or shapes that are close to one another as forming a group.
  2. Similarity: Tendency to group objects if physical resemblance e.g. shape, pattern, color, etc. is present.
  3. Closure: Tendency to see complete figures/forms even if what is present in the image is incomplete.
  4. Symmetry: Tendency to ‘see’ objects as symmetrical and forming around a center point. 50
  5. Common fate: Tendency to associate similar movement as part of a common motion.
  6. Continuity: Tendency to perceive each object as a single uninterrupted i.e. continuous object
  7. Good Gestalt: Tendency to group together if a regular, simple, and orderly pattern can be formed
  8. Past experience: Tendency to categorize objects according to past experience.

Of these, I argue, proximity, similarity, common fate, and past experience are relevant. I even argue on the possibility of applying closure. A recent work by FAIR 22, shows that machine models can fill in the gaps and infer missing pieces correctly by applying minor changes commonly used technique autoencoders. Why I bring this up with so much excitement than GAN-based techniques of hallucination is how easy it is to build and train as compared to GAN.

Masked-Autoencoders showing model inferring missing patches 22

Its been interesting to note that the recent advances towards dealing with the challenges of scaled data are largely inspired by already known deep-learning techniques except they are now applied through the lens of data. Examples like pruning, compression, sampling strategies, and leveraging from phenomena such as catastrophic forgetting, knowledge distillation.

Technique How it's presently utilized in model building Proposed data-centric view
Prunning A specialized class of model compression technique where low magnitude weights are eliminated to reduce the size and computational overhead. Samples that don't contribute much to generalization are omitted from the training regime.
Compression A broad range of model compression techniques to reduce the size and computational overhead includes techniques like quantization wherein some amount of information loss is expected. A broad range of data filtering and compression techniques to reduce size without compromising much on generalization.
Distillation To extract learned representation from a more complex model to a potentially smaller model. To extract knowledge present in the larger dataset into a smaller synthesized set.
Loss function Also termed as the objective function is the one of core concepts of DL that defines the problem statement. As shown in 22, and also more broadly can be leveraged to fill in missing information in the data.
Regularisation One of the theoretical principles of DL is applied through various techniques like BatchNorm, Dropouts to avoid overfitting. Variety of techniques to ensure overfitting applied with data in mind, e.g. Label Smoothing 7,10

*Table 1: Summary of techniques that are crossbreed from core DL techniques to also into data-centric DL *

Let's dive into the details of how these classes of techniques are applied through the lens of data:

1. Regularization

1.1 Mixup

Mixup is a special form of data augmentation technique that looks beyond intra-sample modification and explores inter-sample modification. The idea with mixup is to linearly combine (through convex geometry) a pair of samples to result into one.

$ \begin{equation} x\prime=λx_i + (1−λ)x_j , \ \text{where,} λ ∈ [0,1] \ \text{ drawn from beta distribution, and xi, xj are input/source vector} \end{equation} $

$ \begin{equation} y\prime = λy_i + (1 − λ)y_j , \ \text{where y_i , yj are one-hot label encodings} \end{equation} $

*Figure 4: Sample produced by applying mixup 7 on Oxford Pets dataset *

Mixup 7 in fact seeks to regularize the neural network to favor simple linear behavior in-between training examples. As shown in fig 5, mixup results in better models with fewer missed. Its been shown that mixup increases the generalization, reduces the memorization of corrupt labels, increases the robustness towards adversarial examples 7,8.

Figure 5: Shows that using mixup 7, lower prediction error and smaller gradient norms are observed.

I see mixup as not only an augmentation, and regularisation technique but also a data compression technique. Depending on how frequently (say α) the mixup is applied, the dataset compression ratio (Cr) will :

$ \begin{equation} C_r = 1 - α/2 \end{equation} $

If you have not noticed already, applying mixup convert labels to soft labels. The linear combination of discreet values will result in a continuous label value that can explain the example previously discussed wherein the pixel is 40% cat 60% dog (see fig 5).

1.2 Label Smoothing

Label smoothing 10 is a form of regularisation technique that smoothes out ground truth by a very small value epsilon ɛ. One motivation for this is of course better generalization and avoiding overfitting. While the other motivation is to discourage the model from becoming overconfident. Both 8,10 have shown that label smoothing leads to better models.

$ \begin{equation} Q_{i} = \begin{cases} \displaystyle 1 - ɛ & \text{if i == k,} \
ɛ/K & \text{Otherwise, where K is number of classes} \
\end{cases}
\end{equation} $

Label smoothing as indicated by the equation above does not lead to any visible differences in label data as ɛ is really small. However, applying mixup change visibly changes both source (x) and the label (y).

Applying label-smoothing has no noticeable difference

2. Compression

Compression refers to a broad range of data filtering and compression techniques to reduce size without compromising much on generalization. Following are some of the recent exciting development on this front:

2.1. X-shot learning: How many are enough?

The troubles of high computational cost and long training times due to an increase in the dataset have led to the development of training by a few shot strategies. The intuition behind this approach is to take a model and guide it to learn to perform a new task only by looking at a few samples 11. The concept of transfer learning is implicitly applied in this approach. This line of investigation started with training new tasks by using only a few (handful) samples and explored an extreme case of one-shot training i.e. learning new tasks from only one sample 12,13.

Recently an interesting mega-extreme approach of shot-based learning has emerged - ‘Less Than One’-Shot Learning a.k.a LO Shot learning 11. This approach utilizes soft label concepts and seeks to merge hard label N class samples into M samples where M < N and thus the name less than one! LO Shot-based techniques are a form of data compression technique and may feel very similar to the MixUp technique discussed earlier. However, LO Shot contrary to a convex combination of samples as in Mixup, exploits distance weighted k-Nearest Neighbours technique to infer the soft labels. Their algorithm termed distance-weighted soft-label prototype k-Nearest Neighbours (SLaPkNN) essentially takes the sum of the label vectors of the k-nearest prototypes to target point x, with each prototype weighted inversely proportional to its distance from x. The following figure shows 4 class datasets are merged into 2 samples using SLaPkNN.

Figure LO-Shot: LO Shot splitting 4 class space into 2 points 11.

In my understanding that is the main theoretical difference between the two techniques, with the other difference being mixup only merges two samples into one using a probability drawn from beta distribution combined using λ and 1-λ whereas LO is more versatile and can compress greatly. I am not saying mixup cant be extended to be more multivariate but that empirical analysis of such approach is unknown; whereas with 11 its been shown SLaPkNN can compress 3M − 2 classes into M samples at least.

The technical explanation for this along with code is available here.

2.2. Pruning

Pruning is a subclass of compression techniques wherein samples that are not really helpful or effective are dropped out whereas selected samples are kept as is without any loss in content. Following are some of the known techniques of dataset pruning:

2.2.1 Coresets

Coreset selection technique pertains to subsampling from a large dataset to a smaller set that almost approximates the given large set. This is not a new technique and has heavily been explored using hand-engineered features and simpler models like Markov models to approximate the smaller set. This is not a DL-specific technique either and has its place in classical ML as well. An example could be using naïve Bayes to select coresets for more computationally expensive counterparts like decision trees.

In deep learning, using a similar concept, a lighter-weight DL model can be used as a proxy to select the approximate dataset 15. This is easier achieved when continual learning is practiced otherwise it can be a very expensive technique in itself given proxy model needs to be trained with a full dataset first. This becomes especially tricky given the proxy and target models are different and also when the information in the dataset is not concentrated in a few samples but uniformly distributed over all of them. These are some of the reasons why this approach is not very successful.

2.2.2 Example forgetting

An investigation 14 reported that some samples once learned are never forgotten and exhibit the same behavior across various training parameters and hyperparameters. There are other classes of samples that are forgotten. The forgetting event was defined as when the model prediction regresses in the subsequent epoch. Both qualitative and qualitative (see fig 6 and 7) analysis into such forgotten samples indicated noisy labels, images with “uncommon” visually complicated features were the main reasons for example forgetting.

w Figure 6: Algorithm to track forgotten samples 14.

Figure 7: Indicating how increasing fraction of noisy samples led to increased forgetting events 14.

An interesting observation from this study was that losing a large fraction of unforgotten samples still results in extremely competitive performance on the test set. The hypothesis formed was unforgotten samples are not very informative whereas forgotten samples are more informative and useful for training. In their case, the forgetting score stabilized after 75 epochs (using RESNET & CIFAR but the value will vary as per model and data).

Perhaps a few samples are enough to tell that a cat has 4 legs, a small face, and pointy ears, and it's more about how different varieties of cats look especially if they look different from the norm e.g. Sphynx cats.

2.2.3 Using Gradient norms

Loss functions are an excellent measure to find interesting samples in your dataset whether these may be poorly labeled or really outlier samples. This was highlighted by Andrej Karpathy as well:

When you sort your dataset descending by loss you are guaranteed to find something unexpected, strange, and helpful.

Personally, I have found loss a very good measure to find poorly labeled samples. So, the natural question would be "should we explore how we can use the loss as a measure to prune the dataset?". It's not until NeurIPS 2021, 21 that this was properly investigated. This Standford study looked into the initial loss gradient norm of individual training examples, averaged over several weight initializations, to prune the dataset for better generalization. This work is closely related to example forgetting except that instead of performance measure the focus more is on using local information early in training to prune the dataset.

This work proposes GraNd score of a training sample (x, y) at time t given by L2 norm of the gradient of loss computed on that sample and also expected loss L2 norm termed EL2N (equation below). The reasoning here is that samples with a small GraNd score have abounded influence on learning how to classify the rest of the training data at a given training time. Empirically, this paper found that averaging the norms over multiple weight initializations resulted in scores that correlate with forgetting scores 14 and leads to pruning a significant fraction of samples early in training. They were able to prune 50% of the sample from CIFAR-10 without affecting accuracy, while on the more challenging CIFAR-100 dataset, they pruned 25% of examples with only a 1% drop in accuracy.
$ \begin{equation} χ_t(x, y) = E_{w_t} || g_t(x, y)||_2 \ \tag*{GraNd eq} \ \end{equation} $

$ \begin{equation} χ_t(x, y) = E || p(w_t, x) - y)||_2 \ \tag*{EL2N eq} \ \end{equation} $

This is a particularly interesting approach and is a big departure from other pruning strategies to date which treated samples in the dataset independently. Dropping samples based on independent statistics provides a weaker theoretical guarantee of success as DL is a non-convex problem 21. I am very curious to find out how mixup impacts the GraNd scores given it shown (see figure 5b) using mixup leads to smaller gradient norm (l1 albeit).

Results of prunning with GradNd and EL2N 21.

The results from this study are shown in the fig above. Noticeably high pruning is not fruititious even with this approach despite how well it's doing on CIFAR-10 and 100 datasets. Are we retaining the data distribution when we drop large samples? Mostly not and that is only reasoning that makes sense. And we circle back to how much pruning is enough? Is that network dependent or more a property of data and its distribution? This study 21 claims that GradND and EL2N scores, when averaged over multiple initializations or training trajectories remove dependence on specific weights/networks, presenting a more compressed dataset. If this assertion holds in reality, in my view, this is a very promising finding easing the data-related challenges of DL.

What's more fascinating about this work is that it sheds light on how the underlying data distribution shapes the training dynamics. This has been amiss until now. Of particular interest is identifying subspaces of the model’s data representation that are relatively stable over the training.

2.3. Distillation

Distillation technique refers to the methodologies of distilling the knowledge of a complex or larger set into a smaller set. Knowledge or model distillation is a popular technique that compresses the learned representation of a larger model into a much smaller model without any significant drop in performance. Using student-teacher training regime have been explored extensively even in the case of transformer networks that are even more data-hungry than more conventional network say Convolution network DeiT. Despite being called data-efficient, this paper employs a teacher-student strategy to transform networks, and data itself is merely treated as a commodity.

Recently, this concept is investigated for use in deep learning for dataset distillation with aim of synthesizing an optimal smaller dataset from a large dataset 17,16. The distilled dataset are learned and synthesized but in theory, they approximate the larger dataset. Note that the synthesized data may not follow the same data distribution.

Some dataset distillation techniques refer to their approach as compression as well. I disagree with this in principle as compression albeit lossy in this context, refers to compressing the dataset whereas with distillation the data representation is deduced/synthesized - potentially leading to entirely different samples together. Perhaps it's the compressibility factor a.k.a compression ratio that applies to both techniques. For example, see figure 13 shows the extent to which distilled images can change.

A dataset distillation 17 paper quotes:

We present a new optimization algorithm for synthesizing a small number of synthetic data samples not only capturing much of the original training data but also tailored explicitly for fast model training in only a few gradient steps 17.

Their problem formulation was very interesting! they derive the network weights as a differentiable function of the synthetic training data and set the training objective to optimize the pixel values of the distilled images The result from this study showed that one can go as low as one synthetic image per category while not regressing too much on the performance. More concretely, they distilled 60K training images of the MNIST digit dataset into only 10 synthetic images (one per class) yields a test-time MNIST recognition performance of 94%, compared to 99% for the original dataset).

Figure 8: Dataset distillation results from FAIR study 17.

Here are some of the distilled samples for the classes labeled at the top (fig 9). It's amazing how well the MNIST trained on these sets does but CIFAR one misses the mark only being about as good as random (54%) compared to 80% on the full dataset(fig 8 & 9).

Figure 9: Dataset distillation results from FAIR study 17.

Following this work, another distillation technique was proposed utilizing kernel methods - more specifically kernel ridge regression to obtain ε-approximate of original datasets 18. This technique is termed Kernel Inducing Points (KIP) and follows the same principle for keeping the objective function to maximize the approximation and backpropagate the gradients to learn synthesized distilled data. The difference between 18 and 17 is one 17 uses the same DL network while the other 18 uses kernel techniques. With KIP, another advantage is that not just source samples but optionally labels can be synthesized too. In 17, the objective was purely to learn pixel values and thus the source (X). This paper 18 also proposes another technique Label Solve (LS) in while X is kept constant and only label (Y) is learned.

Figure 10: Examples of distilled samples a) with KIP and b) With LS 18.

The CIFAR 10 result from 17 (fig 9) was about 36.79% for 10 samples, with KIP there is a slight gain in performance there given the extreme compression. This raises the question of what is a good compression ratio that can guarantee good information retention. For complex tasks like CIFAR (compared to MNIST), 10 (one per sample) may not be enough given how complex this dataset is comparatively.

Figure 11: CIFAR10 result from KIP and LS 18.

Actually, LO shot technique 11, discussed previously, is also a specialized form of X-shot technique that does dataset distillation. Aside from that, gradient-based techniques for dataset distillation have been actively investigated in the last few years (ref 16,17,18,19,20). Another approach explored siamese augmentation method termed Differentiable Siamese Augmentation (DSA) that uses matching loss and synthesizes dataset through backpropagation techniques 16

Figure 12: Differentiable Siamese Augmentation 16.

Bayesian and gradient-descent trained neural networks converge to Gaussian processes (GP) as the number of hidden units in intermediary layers approaches infinity 20 (fig 13). This is true for convolutional networks as well as they converge to a particular gaussian processes channel limits are stretched to infinity. These networks can thus be described as kernels known as Neural Tangent Kernel (NTK). Neural Tangents library based on JAX an auto differentiation toolkit has been used in applying these kernels in defining some of the recent distillation methods. References 18,20,21 are one such examples.

Figure 13: Infinite width Convolution networks converging to infinity 20.

The authors of KIP and LS techniques 18 explore how to scale and accelerate the distillation process to apply these techniques to infinitely wide convolutional networks 20. Their results are very promising (see fig 14).

Figure 14: KIP ConvNet results 20.

A visual inspection into the distilled dataset from the infinite width CNN-based KIP technique is shown in fig 15. The distillation results are curious, to say the least. In the example, the distilled apple image seems to represent a pile of apples whereas the bottle distilled results into visibly different bottles while still showing artifacts of the original two bottles. While other classes show high order features (with some noise).

Figure 15: KIP ConvNet example of distilled CIFAR set 20.

Figure 16 shows MNIST results, they are not only very interesting but also look very much like mixup (where x and y both are mixed).

Figure 16: KIP ConvNet example of distilled MNIST set 20.

3. So what if you have noisy data

Noises in the dataset are considered a nuisance. Because models hold the compressed form of knowledge represented by the dataset, dataset curation techniques carefully look to avoid noises in datasets.

Noises can be an incredibly powerful technique to fill in the missing information in source/images. For instance, if only part of an image is known then instead of padding the image with default (0 or 1-pixel value), filling in using random noise can be an effective technique avoiding confusion on actual real values that relate to the black or white region. This has held true in my own experience. It was very amusing to note that the forgetting event study 14 in fact looked into adding label noise on the distribution of forgetting events. They added noise in pixel values and observed that adding an increasing amount of noise decreases the number of unforgettable samples (see also figure 7 for their results when noise was used).

Noise coming from randomness is handled very well by DL networks as well. I find the result from 22, shown in figure 17 quite fascinating actually. Looking at how well the model is doing when missing patches are random and how poor it is doing when missing patches are systematic is indicative of how powerful and lame (both at the same time) machines are!

Figure 17: Noisy patches in Masked-AutoEncoder 22.

GradND study 21, looked into the effect of noise on the source itself and performed a series of experiments to conclude that when there is enough data, keeping the high score examples, which are often noisy or difficult, does not hurt performance and can only help.

Conclusion

In summary, the last four years have been incredibly exciting for data in DL space and the year 2021 even more! There is a lot of mileage we can get out of simpler techniques like MixUp but more exciting developments are dissecting the training dynamics and exploring the importance of samples in solving a particular task using DL techniques. Distillation methods are still in the early stages where they work well for simpler datasets but honestly how many real-world problems have simple datasets? Nevertheless, some really exciting development in this space. These techniques can be groundbreaking if the compression methods hold across a wide range of architectures as indicated by 21.

References

  1. [1707.02968] Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.” Accessed January 3, 2022. https://arxiv.org/abs/1707.02968.
  2. Hestness, Joel, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. “Deep Learning Scaling Is Predictable, Empirically,” December 2017. https://arxiv.org/abs/1712.00409.
  3. https://www.wired.com/story/no-data-is-not-the-new-oil/
  4. https://pages.run.ai/hubfs/PDFs/2021-State-of-AI-Infrastructure-Survey.pdf
  5. [1808.01974] A Survey on Deep Transfer Learning. Accessed January 5, 2022. https://arxiv.org/abs/1808.01974.
  6. Transfer Learning. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.1515&rep=rep1&type=pdf
  7. Zhang, Hongyi, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. “Mixup: Beyond Empirical Risk Minimization,” October 2017. https://arxiv.org/abs/1710.09412.
  8. [1812.01187] Bag of Tricks for Image Classification with Convolutional Neural Networks. Accessed December 30, 2021. https://arxiv.org/abs/1812.01187.
  9. [2009.08449] ’Less Than One’-Shot Learning: Learning N Classes From M < N Samples. Accessed January 5, 2022. https://arxiv.org/abs/2009.08449.
  10. [1512.00567] Rethinking the Inception Architecture for Computer Vision. Accessed January 5, 2022. https://arxiv.org/abs/1512.00567.
  11. [1904.05046] Generalizing from a Few Examples: A Survey on Few-Shot Learning. Accessed January 5, 2022. https://arxiv.org/abs/1904.05046.
  12. Li Fei-Fei, R. Fergus and P. Perona, “One-shot learning of object categories,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 594–611, April 2006, doi: 10.1109/TPAMI.2006.79. https://ieeexplore.ieee.org/document/1597116
  13. [1606.04080] Matching Networks for One Shot Learning. Accessed January 5, 2022. https://arxiv.org/abs/1606.04080.
  14. [1812.05159] An Empirical Study of Example Forgetting during Deep Neural Network Learning. Accessed December 29, 2021. https://arxiv.org/abs/1812.05159.
  15. [1906.11829] Selection via Proxy: Efficient Data Selection for Deep Learning. Accessed December 29, 2021. https://arxiv.org/abs/1906.11829.
  16. [2102.08259] Dataset Condensation with Differentiable Siamese Augmentation. Accessed January 5, 2022. https://arxiv.org/abs/2102.08259.
  17. [1811.10959] Dataset Distillation. Accessed January 5, 2022. https://arxiv.org/abs/1811.10959.
  18. [2011.00050] Dataset Meta-Learning from Kernel Ridge-Regression. Accessed January 5, 2022. https://arxiv.org/abs/2011.00050.
  19. [2006.05929] Dataset Condensation with Gradient Matching. Accessed January 5, 2022. https://arxiv.org/abs/2006.05929.
  20. [2107.13034] Dataset Distillation with Infinitely Wide Convolutional Networks. Accessed January 5, 2022. https://arxiv.org/abs/2107.13034.
  21. [2107.07075] Deep Learning on a Data Diet: Finding Important Examples Early in Training. Accessed December 10, 2021. https://arxiv.org/abs/2107.07075.
  22. [2111.06377] Masked Autoencoders Are Scalable Vision Learners. Accessed January 5, 2022. https://arxiv.org/abs/2111.06377.
  23. [1802.03426] UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Accessed January 5, 2022. 23. https://arxiv.org/abs/1802.03426.
  24. Maaten, Laurens van der and Geoffrey E. Hinton. “Visualizing Data using t-SNE.” Journal of Machine Learning Research 9 (2008): 2579–2605. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
  25. [2107.02331] Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering.” Accessed January 8, 2022. https://arxiv.org/abs/2107.02331

End-to-end reproducible Machine Learning pipelines on Kubernetes

This is Part 3 - End-to-end reproducible Machine Learning pipelines on Kubernetes of technical blog series titled [Reproducibility in Machine Learning]. Part 1 & Part 2 can be found here & here respectively.


Change Anything Changes Everything (CAKE) principle -[Scully et al][scully_2015] is real in ML. Also, 100% reproducible ML code comes at a cost of speed - a non-negotiable aspect in today's time. If we cannot be 100% and change is evident, then the only way to maintaining explainability, understanding, trust & confidence is through version control everything.

Figure 1: Version control explained by XKCD

In this post, we will be looking at building an end to end fully automated ML pipeline that can maintain full provenance across the entire ML system. In my view, a standard machine learning workflow looks like the one below:

Figure 2: Machine Learning end to end system

So we will be working towards building this system with full provenance over it. For this, we will be extending our sample semantic segmentation example based on Oxford Pet dataset

To build this ML workflow, we will be using Kubernetes - a container orchestration platform. On top of Kubernetes, we will be running Pachyderm software that will do the heavy lifting of maintaining provenance across data, artifacts, and ml processes.

What to version control

In part 1 of this blog series, we discussed above the challenges, shown in figure 3, in realizing reproducible ML.

Figure 3: Overview of challenges in reproducible ML

The presence of these challenges in the system-wide view of ML is shown in figure 4.

Figure 4: What to version control?

But first, let's talk about creating the environment, infrastructure, and versioning it.

1. Versioning environment

Using gitops environment and any changes associated with it can be version controlled. In this sample, we will be using ArgoCD to implement GitOps workflow (figure 5) which will see our environment config moved to be alongside the code repository (figure 6).

Figure 5: Gitops

This is achieved by defining argo apps which can be applied on BYO Kubernetes cluster (version 1.14.7) that has ArgoCD installed:

kubectl apply –f https://raw.githubusercontent.com/suneeta-mall/e2e-ml-on-k8s/master/cluster-conf/e2e-ml-argocd-app.yaml

Figure 6: Gitops on environment config

Once the Argo apps are created, the following software will be installed on the cluster: - Kubernetes: 1.14.7 (tested on this version, in theory, should work with other versions too!) - ArgoCD: 1.2.3 - Kubeflow: 0.6.2

Kubeflow is an ML toolkit designed to bring a variety of ML related Kubernetes based software together. - Seldon 0.4.1 (upgraded from packaged version on kubeflow 0.6.2) Seldon is a model serving software - Pachyderm: 1.9.7 Pachyderm offers a git like a repository that can hold data even big data. It also offers automated repository capability that act on input and generate data thus holding a versioned copy of this generated data. Together with these constructs, it can be used to create pipeline DAG like processes with provenance across graph input, transformation spec, output

Any change on this configuration repository will then trigger a cluster update keeping environment in synced with versioned config.

2. Versioning data, process, and artifacts

Figure 7: Artifact view of Machine Learning end to end system (shown in figure 2)

Pachyderm pipeline specification for an end to end ML workflow capability shown in figure 2 is available here. The generated artifacts/data as a result of this ML workflow is shown in figure 7 above. These artifacts and their association with other processes are also highlighted in figure 7.

---
pipeline:
  name: stream
transform:
  image: suneetamall/pykubectl:1.14.7
  cmd:
  - "/bin/bash"
  stdin:
  - "wget -O images.tar.gz https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz && \
     wget -O annotations.tar.gz https://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz && \
     tar -cvf data.tar.gz *.tar.gz && \
     cat data.tar.gz > /pfs/out && \
     while :; do sleep 2073600; done"
spout:
  overwrite: true
---
input:
  pfs:
    glob: /
    repo: stream
pipeline:
  name: warehouse
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python download_petset.py --input /pfs/stream/ --output /pfs/out"
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: warehouse
pipeline:
  name: transform
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python dataset_gen.py --input /pfs/warehouse --output /pfs/out"
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: transform
pipeline:
  name: train
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python train.py --input /pfs/transform --output /pfs/out --checkpoint_path /pfs/out/ckpts --tensorboard_path /pfs/out"
resource_requests:
  memory: 2G
#  gpu:
#    type: nvidia.com/gpu
#    number: 1
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: transform
pipeline:
  name: tune
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python tune.py --input /pfs/transform --output /pfs/out"
resource_requests:
  memory: 4G
  cpu: 1
#  gpu:
#    type: nvidia.com/gpu
#    number: 1
datum_tries: 2
#standby: true
---
input:
  cross:
    - pfs:
       glob: "/"
       repo: transform
    - pfs:
        glob: "/optimal_hp.json"
        repo: tune
pipeline:
  name: model
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python train.py --input /pfs/transform --hyperparam_fn_path /pfs/tune/optimal_hp.json
     --output /pfs/out --checkpoint_path /pfs/out/ckpts --tensorboard_path /pfs/out"
  - "ln -s /pfs/tune/optimal_hp.json /pfs/out/optimal_hp.json"
resource_requests:
  memory: 2G
#  gpu:
#    type: nvidia.com/gpu
#    number: 1
datum_tries: 2
#standby: true
---
input:
  cross:
    - pfs:
       glob: "/calibration"
       repo: transform
    - pfs:
        glob: "/model.h5"
        repo: model
pipeline:
  name: calibrate
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python calibrate.py --input /pfs/transform --model_weight /pfs/model/model.h5 --output /pfs/out"
  - "ln -s /pfs/model/model.h5 /pfs/out/model.h5"
datum_tries: 2
#standby: true
---
input:
  cross:
    - pfs:
       glob: "/test"
       repo: transform
    - pfs:
        glob: "/"
        repo: calibrate
pipeline:
  name: evaluate
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "papermill evaluate.ipynb /pfs/out/Report.ipynb \
      -p model_weights /pfs/calibrate/model.h5 \
      -p calibration_weights /pfs/calibrate/calibration.weights \
      -p input_data_dir /pfs/transform \
      -p out_dir /pfs/out \
      -p hyperparameters /pfs/calibrate/optimal_hp.json"
  - "ln -s /pfs/calibrate/model.h5 /pfs/out/model.h5"
  - "ln -s /pfs/calibrate/calibration.weights /pfs/out/calibration.weights"
resource_requests:
  memory: 1G
datum_tries: 2
#standby: true
---
input:
  pfs:
    glob: "/"
    repo: evaluate
pipeline:
  name: release
transform:
  cmd:
  - "/bin/bash"
  image: suneetamall/e2e-ml-on-k8s:1
  stdin:
  - "python release.py --model_db evaluate --input /pfs/evaluate/evaluation_result.csv --version ${evaluate_COMMIT}"
pod_spec: '{"serviceAccount": "ml-user", "serviceAccountName": "ml-user"}'
datum_tries: 2
#standby: true
---
## Service https://docs.pachyderm.io/en/1/concepts/pipeline-concepts/pipeline/service.html
input:
  pfs:
    glob: "/"
    repo: model
pipeline:
  name: tensorboard
service:
  external_port: 30888
  internal_port: 6006
transform:
  cmd:
  - "/bin/bash"
  stdin:
  - tensorboard --logdir=/pfs/model/
  image: suneetamall/e2e-ml-on-k8s:1
---

This pipeline creates ML workflow, with artifact dependency shown in above figure 7, wherein full provenance across data, processes, and outcomes are maintained along with respective lineage.


This is the last post of the technical blog series, [Reproducibility in Machine Learning].

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

[scully_2015]: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-Machine Learning-systems.pdf

Replicability - an extension to reproducibility

This is Part 3 - End-to-end reproducible Machine Learning pipelines on Kubernetes of technical blog series titled [Reproducibility in Machine Learning]. Part 1 & Part 2 can be found here & here respectively.

Part 1: Reproducibility in Machine Learning - Research and Industry

Replicability and

The research community is quite divided when it comes to defining reproducibility and often mixes it up with replicability

Figure 4: Replicability defined

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

Realizing reproducible Machine Learning - with Tensorflow

This is Part 2 - Realizing reproducible Machine Learning - with Tensorflow of technical blog series titled [Reproducibility in Machine Learning]. Part 1 & Part 3 can be found here & here respectively.


As discussed in Part 1, writing reproducible machine learning is not easy with challenges arising from every direction e.g. hardware, software, algorithms, process & practice, data. In this post, we will focus on what is needed to ensure ML code is reproducible.

First things first

There are a few very simple things that are needed to be done before thinking big and focussing on writing reproducible ML code. These are:

  • Code is version controlled

Same input (data), same process (code) resulting in the same output is the essence of reproducibility. But code keeps evolving, since ML is so iterative. Hence, it's important to version control code (including configuration). This allows obtaining the same i.e. exact code (commit/version) from the source repository.

  • Reproducible runtime – pinned libraries

So we version-controlled code but what about environment/runtime? Sometimes, non-determinism is introduced not direct by user code but also dependencies. We talked about this at length in the software section of Challenges in realizing reproducible ML in Part 1. Taking the example of Pyproj - a geospatial transform library, that I once used to compute geo-location-based on some parameters. We changed nothing but just a version of pyproj from V1.9.6 to V2.4.0 and suddenly all our calculations were giving different results. The difference was so much that location calculation for San Diego Convention Centre was coming out to be somewhere in Miramar off golf course (see figure 1) issue link. Now imagine ordering pizza delivery on the back of my computation snippet backed with an unpinned version of pyproj?

Figure 1: Example of why pinned libraries are important

Challenges like these occur quite often that we would like. That's why it's important to pin/fix dependent runtime either by pinning version of libraries or using versioned containers (like docker).

  • Smart randomness

As discussed in Part 1, randomization plays a key role in most ML algorithms. Unseeded randomness is the simplest way to make code non-reproducible. It is also one of the easiest to manage amongst all things gotchas of non-reproducible ML. All we need to do is seed all the randomness and manage the seed via configuration (as code or external).

  • Rounding precision, under-flows & overflow

Floating point arithmetic is ubiquitous in ML. The complexity and intensity of floating-point operations (FLOPS) are increasing every day with current needs easily meeting Giga-Flops order of computations. To achieve efficiency in terms of speed despite the complexity, mixed precision floating-point operations have also been proposed. As discussed Part 1, accelerated hardware such as (GPGPU), tensor processing unit (TPU), etc. due to their architecture and asynchronous computing do not guarantee reproducibility. Besides, when dealing with floating points, the issues related to overflow and underflow are expected. This just adds to the complexity.

  • Dependent library’s behavior aware

As discussed in the software section of Challenges in realizing reproducible ML in Part 1, some routines of ML libraries do not guarantee reproducibility. For instance, NVIDIA's CUDA based deep learning library cudnn. Similarly, with Tensorflow, using some methods may result in non-deterministic behavior. One such example is backward pass of broadcasting on GPUref. Awareness about behaviors of libraries being used and approaches to overcome the non-determinism should be explored.

Writing reproducible ML

To demonstrate reproducible ML, I will be using Oxford Pet dataset that has labels for pet images. I will be doing a semantic segmentation of pets images and will be using pixel-level labeling. Each pixel of the pet image in Oxford Pet dataset is labeled as 1) Foreground (Pet area), 2) Background (not pet area) and 3) Unknown (edges). These labels by definition are mutually exclusive - i.e. a pixel can only be one of the above 3.

Figure 2: Oxford pet dataset

I will be using a convolution neural network (ConvNet) for semantic segmentation. The network architecture is based on U-net. This is similar to standard semantic segmentation example by tensorflow.

Fihure 3: U-net architecture

The reproducible version of semantic segmentation is available in Github repository. This example demonstrates reproducible ML and also performing end to end ML with provenance across process and data.

Figure 4: Reproducible ML sample - semantic segmentation of oxford pet

In this post, however, I will be discussing only the reproducible ML aspect of it and will be referencing snippets of this example.

ML workflow

In reality, a machine learning workflow is very complex and looks somewhat similar to figure 5. In this post, however, we will only discuss the data and model training part of it. The remaining workflow i.e. the end-to-end workflow will be discussed in next post.

Figure 5: Machine learning workflow

Data

The source dataset is Oxford Pet dataset which contains a multitude of labels e.g. class outcome, pixel-wise label, bounding boxes, etc. The first step is to process this data to generate the trainable dataset. In the example code, this is done by download_petset.py script.

python download_petset.py  --output /wks/petset
The resulting sample is shown in figure 6.

Figure 6: Pets data partitioned by Pets ID

Post data partition, the entire dataset is divided into 4 sets: a) training, b) validation, c) calibration, and d) test We would want this set partitioning strategy to be reproducible. By doing this, we ensure that if we have to blow away the training dataset, or if accidental data loss occurs then the exact dataset can be created.

In this sample, this is achieved by generating the hash of petid and partitioning the hash into 10 folds (script below) to obtain partition index of pet id.

partition_idx = int(hashlib.md5((os.path.basename(petset_id)).encode()).hexdigest(), 16) % 10

With partition_idx 0-6 assigned for training, 7 for validation, 8 for calibration, and 9 for the test, every generation will result in pets going into their respective partition.

Besides, to set partitioning, any random data augmentation performed is seeded with seed controlled as configuration as code. See tf.image.random_flip_left_right used in this tensorflow data pipeline method.

Script for model dataset preparation is located in dataset_gen.py and used as following

python dataset_gen.py --input /wks/petset --output /wks/model_dataset
with results shown below:

Figure 7: Pets data partitioned into training, validation, calibration, and test set

Modelling semantic segmentation

The model for pet segmentation is based on U-net with backbone of either MobileNet-v2 or VGG-19 (defaults to VGG-19). As per this model's network architecture, 5 activation layers of pre-trained backbone network are chosen. These layers are: * MobileNet

'block_1_expand_relu' 
'block_3_expand_relu'  
'block_6_expand_relu' 
'block_13_expand_relu'
'block_16_project'
  • VGG
    'block1_pool'
    'block2_pool'
    'block3_pool'
    'block4_pool'
    'block5_pool'
    

Each of these layers is then concatenated with a corresponding upsampling layer comprising of Conv2DTranspose layer forming what's known as skip connection. See model code for more info.

The training script train.py can be used as following:

python train.py --input /wks/model_dataset --output /wks/model --checkpoint_path /wks/model/ckpts --tensorboard_path /wks/model
1. Seeding randomness

All methods exploiting randomness is used with appropriate seed.
- All random initialization is seeded e.g. tf.random_normal_initializer - All dropout layers are seeded e.g. dropout

Global seed is set for any hidden methods that may be using randomness by calling in set_seed(seed) which sets seed for used libraries:

def set_seeds(seed=SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
2. Handing library behaviors
2.1 CuDNN

CuDNN does not guarantee reproducibility in some of its routines. Environment variable TF_DETERMINISTIC_OPS & TF_CUDNN_DETERMINISTIC can be used to control this behavior as per this snippet (figure 8) from cudnn release page.

Figure 8: NVIDIA release page snippet reproducibility

2.2 CPU thread parallelism

As discussed in Part 1, using high parallelism with compute intensive workflow may not be reproducible. Configurations for interref and intraref operation parallelism should be set if 100% parallelism is desired.

In this example, I have chosen 1 to avoid any non-determinism arising from inter-operation parallelism. Warning setting this will considerably slow down the training.

tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
2.3 Tensorflow determinism

Following are some of the application of Tensorflow that are not reproducible: - Backward pass of broadcasting on GPU is non-deterministiclink - Mention that GPU reductions are nondeterministic in docslink - Problems Getting TensorFlow to behave Deterministicallylink

Duncan Riach, along with several other contributors have created tensorflow_determinism package that can be used to overcome non-reproducibility related challenges from TensorFlow. It should be used in addition to the above measures we have discussed so far.

If we combine all the approaches discussed above (aside from using seeded randomness), they can be wrapped into a lightweight method like one below:

def set_global_determinism(seed=SEED, fast_n_close=False):
    """
        Enable 100% reproducibility on operations related to tensor and randomness.
        Parameters:
        seed (int): seed value for global randomness
        fast_n_close (bool): whether to achieve efficient at the cost of determinism/reproducibility
    """
    set_seeds(seed=seed)
    if fast_n_close:
        return

    logging.warning("*******************************************************************************")
    logging.warning("*** set_global_determinism is called,setting full determinism, will be slow ***")
    logging.warning("*******************************************************************************")

    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
    # https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads
    tf.config.threading.set_inter_op_parallelism_threads(1)
    tf.config.threading.set_intra_op_parallelism_threads(1)
    from tfdeterminism import patch
    patch()
which can then be used on top of the ML algorithm/process code to generate a 100% reproducible code.

Result

What happens if we don't write reproducible ML? What kind of difference we are really talking about? The last two columns of figure 9 show results obtained by a model trained on the exact same dataset, exactly the same code with EXACTLY one exception. The dropout layer used in the network were unseeded. All other measures discussed above were taken into account.

Figure 9: Effect of just forgetting to set one seed amidst many

Looking at the result of the first pet which is a very simple case, we can see the subtle difference in the outcome of these two models. The second pet case is slightly complicated due to shadow and we can see obvious differences in the outcome. But what about the third case, this is a very hard case for pre-trained frozen backbone model we are using but we can see major differences in result between the two models.

If we were to use all the measures discussed above then 100% reproducible ML can be obatined. This is shown in the following 2 logs obtained by running the following:

python train.py \
  --input /wks/model_dataset \
  --hyperparam_fn_path best_hyper_parameters.json \
  --output logs \
  --checkpoint_path "logs/ckpts" \
  --tensorboard_path logs
wherein content of best_hyper_parameters.json are:
{
   "batch_size":60,
   "epochs":12,
   "iterations":100,
   "learning_rate":0.0018464290366223407,
   "model_arch":"MobileNetV2",
   "steps_per_epoch":84
}
Run attempt 1:
WARNING:root:******* set_global_determinism is called, setting seeds and determinism *******
TensorFlow version 2.0.0 has been patched using tfdeterminism version 0.3.0
Input: tf-data, Model: MobileNetV2, Batch Size: 60, Epochs: 12, Learning Rate: 0.0018464290366223407, Steps Per Epoch: 84
Train for 84 steps, validate for 14 steps
Epoch 1/12
2019-11-07 12:48:27.576286: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
84/84 [==============================] - 505s 6s/step - loss: 0.8187 - iou_score: 0.4867 - f1_score: 0.6092 - binary_accuracy: 0.8749 - val_loss: 1.0283 - val_iou_score: 0.4910 - val_f1_score: 0.6297 - val_binary_accuracy: 0.8393
Epoch 2/12
84/84 [==============================] - 533s 6s/step - loss: 0.6116 - iou_score: 0.6118 - f1_score: 0.7309 - binary_accuracy: 0.9150 - val_loss: 0.6965 - val_iou_score: 0.5817 - val_f1_score: 0.7079 - val_binary_accuracy: 0.8940
Epoch 3/12
84/84 [==============================] - 527s 6s/step - loss: 0.5829 - iou_score: 0.6301 - f1_score: 0.7466 - binary_accuracy: 0.9197 - val_loss: 0.6354 - val_iou_score: 0.6107 - val_f1_score: 0.7312 - val_binary_accuracy: 0.9038
Epoch 4/12
84/84 [==============================] - 503s 6s/step - loss: 0.5733 - iou_score: 0.6376 - f1_score: 0.7528 - binary_accuracy: 0.9213 - val_loss: 0.6192 - val_iou_score: 0.6227 - val_f1_score: 0.7411 - val_binary_accuracy: 0.9066
Epoch 5/12
84/84 [==============================] - 484s 6s/step - loss: 0.5566 - iou_score: 0.6461 - f1_score: 0.7599 - binary_accuracy: 0.9241 - val_loss: 0.5827 - val_iou_score: 0.6381 - val_f1_score: 0.7534 - val_binary_accuracy: 0.9156
Epoch 6/12
84/84 [==============================] - 509s 6s/step - loss: 0.5524 - iou_score: 0.6497 - f1_score: 0.7629 - binary_accuracy: 0.9247 - val_loss: 0.5732 - val_iou_score: 0.6477 - val_f1_score: 0.7605 - val_binary_accuracy: 0.9191
Epoch 7/12
84/84 [==============================] - 526s 6s/step - loss: 0.5439 - iou_score: 0.6544 - f1_score: 0.7669 - binary_accuracy: 0.9262 - val_loss: 0.5774 - val_iou_score: 0.6456 - val_f1_score: 0.7590 - val_binary_accuracy: 0.9170
Epoch 8/12
84/84 [==============================] - 523s 6s/step - loss: 0.5339 - iou_score: 0.6597 - f1_score: 0.7710 - binary_accuracy: 0.9279 - val_loss: 0.5533 - val_iou_score: 0.6554 - val_f1_score: 0.7672 - val_binary_accuracy: 0.9216
Epoch 9/12
84/84 [==============================] - 518s 6s/step - loss: 0.5287 - iou_score: 0.6620 - f1_score: 0.7730 - binary_accuracy: 0.9288 - val_loss: 0.5919 - val_iou_score: 0.6444 - val_f1_score: 0.7584 - val_binary_accuracy: 0.9148
Epoch 10/12
84/84 [==============================] - 506s 6s/step - loss: 0.5259 - iou_score: 0.6649 - f1_score: 0.7753 - binary_accuracy: 0.9292 - val_loss: 0.5532 - val_iou_score: 0.6554 - val_f1_score: 0.7674 - val_binary_accuracy: 0.9218
Epoch 11/12
84/84 [==============================] - 521s 6s/step - loss: 0.5146 - iou_score: 0.6695 - f1_score: 0.7789 - binary_accuracy: 0.9313 - val_loss: 0.5586 - val_iou_score: 0.6581 - val_f1_score: 0.7689 - val_binary_accuracy: 0.9221
Epoch 12/12
84/84 [==============================] - 507s 6s/step - loss: 0.5114 - iou_score: 0.6730 - f1_score: 0.7818 - binary_accuracy: 0.9317 - val_loss: 0.5732 - val_iou_score: 0.6501 - val_f1_score: 0.7626 - val_binary_accuracy: 0.9179

Run attempt 2:

WARNING:root:******* set_global_determinism is called, setting seeds and determinism *******
TensorFlow version 2.0.0 has been patched using tfdeterminism version 0.3.0
Input: tf-data, Model: MobileNetV2, Batch Size: 60, Epochs: 12, Learning Rate: 0.0018464290366223407, Steps Per Epoch: 84
Train for 84 steps, validate for 14 steps

Epoch 1/12
2019-11-07 10:45:51.549715: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
84/84 [==============================] - 549s 7s/step - loss: 0.8187 - iou_score: 0.4867 - f1_score: 0.6092 - binary_accuracy: 0.8749 - val_loss: 1.0283 - val_iou_score: 0.4910 - val_f1_score: 0.6297 - val_binary_accuracy: 0.8393
Epoch 2/12
84/84 [==============================] - 515s 6s/step - loss: 0.6116 - iou_score: 0.6118 - f1_score: 0.7309 - binary_accuracy: 0.9150 - val_loss: 0.6965 - val_iou_score: 0.5817 - val_f1_score: 0.7079 - val_binary_accuracy: 0.8940
Epoch 3/12
84/84 [==============================] - 492s 6s/step - loss: 0.5829 - iou_score: 0.6301 - f1_score: 0.7466 - binary_accuracy: 0.9197 - val_loss: 0.6354 - val_iou_score: 0.6107 - val_f1_score: 0.7312 - val_binary_accuracy: 0.9038
Epoch 4/12
84/84 [==============================] - 515s 6s/step - loss: 0.5733 - iou_score: 0.6376 - f1_score: 0.7528 - binary_accuracy: 0.9213 - val_loss: 0.6192 - val_iou_score: 0.6227 - val_f1_score: 0.7411 - val_binary_accuracy: 0.9066
Epoch 5/12
84/84 [==============================] - 534s 6s/step - loss: 0.5566 - iou_score: 0.6461 - f1_score: 0.7599 - binary_accuracy: 0.9241 - val_loss: 0.5827 - val_iou_score: 0.6381 - val_f1_score: 0.7534 - val_binary_accuracy: 0.9156
Epoch 6/12
84/84 [==============================] - 494s 6s/step - loss: 0.5524 - iou_score: 0.6497 - f1_score: 0.7629 - binary_accuracy: 0.9247 - val_loss: 0.5732 - val_iou_score: 0.6477 - val_f1_score: 0.7605 - val_binary_accuracy: 0.9191
Epoch 7/12
84/84 [==============================] - 506s 6s/step - loss: 0.5439 - iou_score: 0.6544 - f1_score: 0.7669 - binary_accuracy: 0.9262 - val_loss: 0.5774 - val_iou_score: 0.6456 - val_f1_score: 0.7590 - val_binary_accuracy: 0.9170
Epoch 8/12
84/84 [==============================] - 514s 6s/step - loss: 0.5339 - iou_score: 0.6597 - f1_score: 0.7710 - binary_accuracy: 0.9279 - val_loss: 0.5533 - val_iou_score: 0.6554 - val_f1_score: 0.7672 - val_binary_accuracy: 0.9216
Epoch 9/12
84/84 [==============================] - 518s 6s/step - loss: 0.5287 - iou_score: 0.6620 - f1_score: 0.7730 - binary_accuracy: 0.9288 - val_loss: 0.5919 - val_iou_score: 0.6444 - val_f1_score: 0.7584 - val_binary_accuracy: 0.9148
Epoch 10/12
84/84 [==============================] - 531s 6s/step - loss: 0.5259 - iou_score: 0.6649 - f1_score: 0.7753 - binary_accuracy: 0.9292 - val_loss: 0.5532 - val_iou_score: 0.6554 - val_f1_score: 0.7674 - val_binary_accuracy: 0.9218
Epoch 11/12
84/84 [==============================] - 495s 6s/step - loss: 0.5146 - iou_score: 0.6695 - f1_score: 0.7789 - binary_accuracy: 0.9313 - val_loss: 0.5586 - val_iou_score: 0.6581 - val_f1_score: 0.7689 - val_binary_accuracy: 0.9221
Epoch 12/12
84/84 [==============================] - 483s 6s/step - loss: 0.5114 - iou_score: 0.6730 - f1_score: 0.7818 - binary_accuracy: 0.9317 - val_loss: 0.5732 - val_iou_score: 0.6501 - val_f1_score: 0.7626 - val_binary_accuracy: 0.9179

So we have 100% reproducible ML code now but saying training is snail-ish is an understatement. Training time has increased (CPU based measures) from 28 minutes vs 1 hr 45 minutes as we give away with inter-thread parallelism and also asynchronous computation optimization. This is not practical in reality. This is also why reproducibility in ML is more focussed around having a road map to reach the same conclusions- Dodge. This is realized by maintaining a system capable of capturing full provenance over everything involved in the ML process including data, code, processes, and infrastructure/environment. This will be the focus of part 3 of this blog series.


The next part of the technical blog series, [Reproducibility in Machine Learning], is End-to-end reproducible Machine Learning pipelines on Kubernetes.

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

Reproducibility in Machine Learning - Research and Industry

This is Part 1 - Reproducibility in Machine Learning - Research and Industry of technical blog series titled [Reproducibility in Machine Learning]. Part 2 & Part 3 can be found here & here respectively.


Machine learning (ML) is an interesting field aimed at solving problems that can not be solved by applying deterministic logic. In fact, ML solves problems in logits [0, 1] with probabilities! ML is a highly iterative and fiddly field with much of its intelligence derived from data upon application of complex mathematics. Sometimes, even a slight change such as changing the order of input/data can change the outcome of ML processes drastically. Actually xkcd quite aptly puts it:

Figure 1: Machine Learning explained by XKCD

This phenomenon is explained as Change Anything Changes Everything a.k.a. CAKE principle coined by Scully et. al in their NIPS 2015 paper titled ["Hidden Technical Debt in Machine Learning Systems"][scully_2015]. CAKE principle highlights that in ML - no input is ever really independent.

What is reproducibility in ML

Reproducibility as per the Oxford dictionary is defined as something that can be produced again in the same way. Figure 2: Reproducible defined

In ML context, it relates to getting the same output on the same algorithm, (hyper)parameters, and data on every run.

To demonstrate, let's take a simple linear regression example (shown below) on Scikit Diabetes Dataset. Linear regression is all about fitting a line i.e. Y = a + bX over data-points represented as X, with b being the slope and a being the intercept.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

diabetes = datasets.load_diabetes()    
diabetes_X = diabetes.data[:, np.newaxis, 9]
xtrain, xtest, ytrain, ytest = train_test_split(diabetes_X, diabetes.target, test_size=0.33)
regr = linear_model.LinearRegression()
regr.fit(xtrain, ytrain)
diabetes_y_pred = regr.predict(xtest)

# The coefficients 
print(f'Coefficients: {regr.coef_[0]}\n'
      f'Mean squared error: {mean_squared_error(ytest, diabetes_y_pred):.2f}\n'
      f'Variance score: {r2_score(ytest, diabetes_y_pred):.2f}')
# Plot outputs 
plt.scatter(xtest, ytest,  color='green')
plt.plot(xtest, diabetes_y_pred, color='red', linewidth=3)
plt.ylabel('Quantitative measure of diabetes progression')
plt.xlabel('One of six blood serum measurements of patients')
plt.show()
A linear regression example on Scikit Diabetes Dataset

Above ML code is NOT reproducible. Every run will give different results: a) The data distribution will vary and b) Obtained slop and intercept will vary. See Figure 3.

Figure 3: Repeated run of above linear regression code produces different results

In the above example, we are using the same dataset, same algorithm, same hyper-parameters. So why are we getting different results? Here the method train_test_split splits the diabetes dataset into training and test but while doing so, it performs a random shuffle of the dataset. The seed for this random shuffle is not set here. Because of this, every run produces different training dataset distribution. Due to this, the regression line slope and intercept ends up being different. In this simple example, if we were to set a random state for method train_test_split e.g. random_state=42 then we will have a reproducible regression example over the diabetes dataset. The reproducible version of the above regression example is as follows:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

diabetes = datasets.load_diabetes()    
diabetes_X = diabetes.data[:, np.newaxis, 9]
xtrain, xtest, ytrain, ytest = train_test_split(diabetes_X, diabetes.target, test_size=0.33,
                                                random_state=42)
regr = linear_model.LinearRegression()
regr.fit(xtrain, ytrain)
diabetes_y_pred = regr.predict(xtest)

# The coefficients 
print(f'Coefficients: {regr.coef_[0]}\n'
      f'Mean squared error: {mean_squared_error(ytest, diabetes_y_pred):.2f}\n'
      f'Variance score: {r2_score(ytest, diabetes_y_pred):.2f}')
# Plot outputs 
plt.scatter(xtest, ytest,  color='green')
plt.plot(xtest, diabetes_y_pred, color='red', linewidth=3)
plt.ylabel('Quantitative measure of diabetes progression')
plt.xlabel('One of six blood serum measurements of patients')
plt.show()
A reproducible linear regression example on Scikit Diabetes Dataset

Seeding random state is not the only challenge in writing reproducible ML. In fact, there are several reasons why reproducibility in ML is so hard to achieve. But I will go into that a bit later in section Challenges in realizing reproducible ML. The first question should be "why reproducibility matters in ML"?

Importance of reproducibility in ML

Non-reproducible single occurrences are of no significance to science. - Popper (The Logic of Scientific Discovery)

The importance of reproducibility is increasingly getting recognized since Nature's Survey (2016) reported a reproducibility crisis. As per this survey report, 70% of researchers have failed to reproduce another scientist's experiments, and more than 50% have failed to reproduce their own experiments. With more than half of participating scientist agreeing to the presence of reproducibility crisis is indeed very real. Dr. Joelle Pineau, an Associate Professor at McGill University and lead for Facebook’s Artificial Intelligence Research lab, covered the reproducibility crisis in her talk at International Conference on Learning Representations (ICLR) 2018 you tube. She is determined to nip
this crisis in bud from AI researchsrc. It's not just her, several AI research groups are coming up with measures to ensure reproducibility (example below): - Model Card at Google - Reproducibility Checklist at NeurIPS - ICLR Reproducibility Challenge at ICLR - Show your work at Allen Institute of Artificial Intelligence

Aside from being of no use if can't be reproduced, as Popper suggested in the above quote, why does reproducibility matter?

1. Understanding, Explaining, Debugging, and Reverse Engineering

Reproducibility helps with understanding, explaining, and debugging. Reproducibility is also a crucial means to reverse engineering.

Machine learning is inherently difficult to explain, understand, and also debug. Obtaining different output on the subsequent run just makes this whole understanding, explaining, debugging thing all the more challenging. How do we ever reverse engineer? As it is, understanding and explaining are hard with machine learning. It's increasingly harder with deep learning. For over a decade, researches are have been trying to understand what these deep networks learn and yet have not 100% succeeded in doing so.

From visualizing higher layer features of deep networks year 2009 to activation-atlases i.e. what individual neurons in deep network do year 2017 to understanding how deep networks decides year 2018 - are all ongoing progressive efforts towards understanding. Meanwhile, explainability has morphed into a dedicated field 'Explainable Artificial Intelligence XAI.

2. Correctness

If anything can go wrong, it will -Murphy's law

Correctness is important as Murphy's law rarely fails us. These are some of the examples of great AI failures of our times.
Figure 4: Example of some of the great AI failures of our times

Google Photos launched AI capabilities with automatically tagging images. It was found to be tagging people of dark skin as gorillas. Amazon's recruiting software exhibiting gender bias or even IBM's Watson giving unsafe recommendations for cancer treatment.

ML output should be correct in addition to being explainable. Reproducibility helps to achieve correctness through understanding and debugging.

3. Credibility

ML output must be credible. It's not just from a fairness, ethical viewpoint but also because they sometimes impact lives (e.g. mortgage approval). Also, end-users of ML output expect answers to verifiable, reliable, unbiased, and ethical. As Lecun said in his International Solid State Circuit Conference in San Francisco, 2019 keynote:

Good results are not enough, Making them easily reproducible also makes them credible. - Lecun, ISSCC 2019

4. Extensibility

Reproducibility in preceding layers is needed to build out and extend. Can we build a building outline model if we cant repeatedly generate roof semantics as shown in figure 5? What if we keep getting the different size for the same roof? Figure 5: Extending ML

Extensibility is essential to utilizing ML outputs for consumption. As it is, raw outputs from ML are rarely usable by end-user. Most ML outputs need to be post-processed and augmented to be consumption ready.

4. Data harvesting

The world’s most valuable resource is no longer oil, but data! - economist.com

To train a successful ML algorithm large dataset is mostly needed - this is especially true for deep-learning. Obtaining large volumes of training data, however, is not always easy - it can be quite expensive. In some cases the occurrences of the scenario can be so rare that obtaining a large dataset will either take forever or is simply not possible. For eg. dataset for Merkel-cell carcinoma, a type of skin cancer that's very rare, will be very challenging to procure.

For this reason, data harvesting a.k.a. synthetic data generation is considered. Tirthajyoti Sarkar, the author of Data Wrangling with Python: Creating actionable data from raw sources, wrote an excellent post on data harvesting using scikit that covers this topic in detail. However, more recently, Generative Adversarial Networks (GAN) by Ian Goodfellow is being heavily used for this purpose. Synthetic Data for Deep Learning is an excellent review article that covers this topic in detail for deep-learning.

Give ML models e.g. (GAN) are being used to generate training data now, it's all the more important that reproducibility in such application is ensured. Let's say, we trained a near-perfect golden goose model on data (including some synthetic). But the storage caught proverbial fire, and we lost this golden goose model along with data. Now, we have to regenerate the synthetic data and obtain the same model but the synthetic data generation process is not quite reproducible. Thus, we lost the golden goose!

Challenges in realizing reproducible ML

Reproducible ML does not come in easy. A wise man once said:

When you want something, all the universe conspires in helping you to achieve it. - The Alchemist by Paulo Coelho

But when it comes to reproducible ML it's quite the contrary. Every single resource and techniques (Hardware, Software, Algorithms, Process & Practice, Data) needed to realize ML poses some kind of challenge in meeting reproducibility (see figure 6).

Figure 6: Overview of challenges in reproducible ML

1. Hardware

ML algorithms are quite compute hungry. Complex computation needed by ML operation, now a day's runs in order of Giga/Tera floating-point operations (GFLOPS/TFLOPS). Thus needing high parallelism and multiple central processing Units (CPU) if not, specialized hardware such as (general purpose) graphics processing unit (GPU) or more specifically (GPGPU), tensor processing unit (TPU) etc. to complete in the reasonable time frame.

But these efficiencies in floating-point computations both at CPU & GPU level comes at a cost of reproducibility.

  • CPU

Using Intra-ops (within an operation) and inter-ops (amongst multiple operations) parallelism on CPU can sometimes give different results on each run. One such example is using OpenMP for (intra-ops) parallelization. See this excellent talk titled "Corden’s Consistency of Floating Point Results or Why doesn’t my application always give" Corden 2018 for more in-depth insight into this. Also see wandering precision blog.

  • GPU

General purpose GPUs can perform vector operations due to stream multiprocessing (SEM) unit. The asynchronous computation performed by this unit may result in different results on different runs. Floating multiple adders (FMAD) or even reductions in floating-point operands are such examples. Some algorithms e.g. vector normalization, due to reduction operations, can also be non-reproducible. See reproducible summation paper for more info.

Changing GPU architecture may lead to different results too. The differences in SEM, or architecture-specific optimizations are a couple of reasons why the differences may arise.

See Corden's Consistency of Floating-Point Results or Why doesn’t my application always give the same answer for more details.

2. Software

It's not just hardware. Some software's offering high-level abstraction or APIs for performing intensive computation do not guarantee reproducibility in their routines. For instance NVIDIA's popular cuda based deep learning library cudnn do not guarantee reproducibility in some of their routines e.g. cudnnConvolutionBackwardFilter>sup>ref. Popular deep learning libraries such as tensorflowref 1,ref 2, pytorchref also do not guarantee 100% reproducibility.

There is an excellent talk Duncan Riach, maintainer of tensorflow_determinism on Determinism in deep learning presented at GPU technology conference by NVIDIA 2019

Sometimes it's not just a trade-off for efficiency but simple software bugs that lead to non-reproducibility. One such example is this bug that I ran into resulting in different geo-location upon same computation when a certain library version was upgraded. This is a clear case of software bug but underlines the fact that reproducibility goes beyond just computation, and precision.

3. Algorithm

Several ML algorithms can be non-reproducible due to the expectation of randomness. Few examples of these algorithms are dropout layers, initialization. Some algorithms can be non-deterministic due to underlined computation complexity requiring non-reproducible measures similar to ones discussed in the software section. Some example of these are e.g. vector normalization, backward pass.

4. Process & Practice

ML loves randomness!

Figure 7: Randomness defined by xkcd

When things don't work with ML - we randomize (pun intended). We have randomness everywhere - from algorithms to process and practices for instance: - Random initializations - Random augmentations
- Random noise introduction (adversarial robustness) - Data Shuffles

To to ensure that randomness is seeded and can be reproduced (much like earlier example of scikit linear regression), with python, a long list of seed setting ritual needs to be performed:

os.environ['PYTHONHASHSEED'] = str(seed)
random.seed(seed)
tensorflow.random.set_seed(seed)
numpy.random.seed(seed)
tensorflow.keras.layers.Dropout(x, seed=SEED)
tensorflow.image.random_flip_left_right(x, seed=seed)
tensorflow.random_normal_initializer(x, y, seed=seed)
# many such algorithmic layers as aabove 

Can we ever win with this seed setting?

Figure 8: Seed setting (image credit: google)

5. Data

No input is ever really independent. [Scully et. al 2015][scully_2015]

Data is the main input to ML algorithms and these algorithms are just compute hungry but also data-hungry. So we are really talking about big data. When data volume is large, we are dealing with all sorts of challenges: - Data management - Data provenance
- Data poisoning - Under-represented data (inappropriate demographic) - Over-represented data (inappropriate demographic)

One of the reasons why ML is so iterative because we need to evolve the ML algorithm with data whilst also continuously evolving data (e.g. data massaging, feature engineering, data augmentations). That's why data provenance is important but it's also important to maintain a linage with data provenance to ML processes. In short, an end to end provenance is needed with ML processes.

6. Concept drift

A model is rarely deployed twice. [Talby, 2018][Talby]

One of the reason for why a model rarely gets deployed more than once [ref][Talby] is Concept drift. Our concept of things and stuff keeps evolving. Don't believe me? figure 9 shows how we envisaged car 18's to now. Our current evolving impression of the car is solar power self-driving cars!

Figure 9: Our evolving concept of car

So, now we don't just have to manage reproducibility over one model but many! Because our model needs to continually keep learning in a more commonly known term in ML as Continual learning more info. An interesting review paper on this topic is here.

Figure 10: Top features - Dresner Advisory Services Data Science and Machine Learning Market Study

In fact, Continual learning is so recognized that support for easy iteration & continuous improvement were the top two features industry voted as their main focus with ML as per Dresner Advisory Services’6th annual 2019 Data Science and Machine Learning Market Study (see figure 10).


The next part of this technical blog series, [Reproducibility in Machine Learning], is Realizing reproducible Machine Learning - with Tensorflow.

[Reproducibility in Machine Learning]: /2019/12/20/Reproducibility-in-Machine Learning.html

[scully_2015]: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-Machine Learning-systems.pdf

[Talby]: https://www.oreilly.com/radar/lessons-learned-turning-Machine Learning-models-into-real-products-and-services/