1. Status of this document
This is a really unofficial draft. It’s not meant to capture any consensus, beyond my own personal feelings about what sounds interesting. It is provided for discussion only and may change at any moment, and should not be taken as "official" or even "unofficial, but planned". Its publication here does not imply endorsement of its contents by W3C or by Microsoft. Don’t cite this document other than as a collection of interesting ideas.
2. Introduction
Machine Learning (ML) algorithms have been significantly improved in terms of accuracy, reliability, and performance in recent years. While typically thought of a technology for the cloud, machine learning have its applications on the device as well. Developing a machine learning model usually involves two stages: training and inference. In the first stage, the developer decides on a skeleton model and feed large dataset to the model in repeated iterations to train the model. Then the model would then be ported to production environment to infer insights based on real time incoming data. While training is typically performed on the cloud, Inference can occur in the cloud or on the device. Performing inference on the device have many desirable properties including performance boost due to edge computing, resistance toward poor or no network, and security/privacy protection, etc.
Although platforms for native applications have all shipped APIs to support machine learning inference on device, similiar functionalities have been missing on the web platform. Supporting such functionalities can not only supercharge existing applications but also unlock new scenarios. For example, with the help of service worker, developers can have their text translation application to be available offline. By inferring the user’s emotions based on user’s input (be it text, image, or video), developers can build a rich emotional experience. Applications on new frontiers such as Mixed Reaility can become much "smarter."
Today when web developers want to develop machine learning models, they face bottlenecks in terms of memory, performance, and power consumptions. Although various existing APIs ease the pain a little, a new set of APIs are necessary for unlocking ML on the web.
The explainer describes the use cases and developer interests that motivate the API, examines existing platform support, demonstrats a few known techniques to break the performance bottlenecks, and sketches out a initial set of requirements for the final API design. It is important to note that machine learning is a broad field. The explainer focuses on some areas (such as neural network) I found particularly interesting but there are other areas that haven’t been mentioned yet. The explainer is written to spark conversations about ML on the web and additions/corrections are welcomes.
3. Terminology
- machine learning
-
A field of study that gives computers the ability to learn without explicitly programmed, according to the Arthur Samuel who coined the term in 1959. This is in constrat with purpose-built software programs that has its behavior logic explicitly defined by developers.
- neural networks
-
A set of machine learning algorithms that take inspiration from the way brains operate. It is generally believed that the main computational unit of the brain is the neurons and network of neurons enable brains to "compute."
- deep neural networks
- DNNs
-
A subset of neural network algorithms that use multiple layers of neural network. The use of DNNs is behind several recent big breakthroughs in mahcine learning.
- training
- train
-
Typically the first stage during development machine learning models. Developing machine learning applications typically involve two stages: training and inference. Training the network or model involves processing the data, feeding the data to determine the appropriate values in the network, and determine if the accuracy of the model is sufficient. Once trained, the model should be considered sufficiently accurate for it pre-determined purpose. Because training usually involves a very large dataset and many rounds of iterations, developers generally train the network on the cloud or machines with high computing power.
- inference
- infer
-
Typically the second stage during development machine learning models. Developing machine learning applications typically involve two stages: training and inference. At this stage, developers optimize their machine learning models for production environment. Depending on the scenarios, developers may accept a small drop of accuracy for the sake of speed or size.
- incremental learning
-
A possible follow-up stage after the developer developed the initial model. Developers can use the incremental learning technique to improve the existing model.
- transfer learning
-
Developers can use a technique called transfer learning to use learning from one model to develop another model. For example, a model trained to recognize animals can be used to recognize dogs.
4. Status Quo
4.1. Native Platforms
All native application platforms have shippped APIs to support machine learning and/or neural networks. For example, iOS and MacOS shipped Basic Neural Network Subroutines (BNNS) and updated Accelerate Framework for Core ML. The Universal Windows Platform (UWP) has added support for CNTK. Android is also said to release a Deep Neural Network API soon.
Platform and developers have also built extensive framework on top of these APIs for mobile scenarios. Examples include Facebook’s Caffe2go, Google’s TensorFlow Lite, Apple’s CoreML Framework, and CNTK’s support for UWP.
4.2. Web Developer Interests
The web development community has shown strong interest in machine learning by creating libraries or frameworks to simplify neural network development. For example, Andrej Karpathy developed the ConvNetJS library, which can be used to build convolutional networks, deep neural networks etc. Another example is the Synaptic.js library, which implements a recurrent neural network architecture. For a comprehensive look at existing libraries, click here.
Although the above libraries cover many use cases already, they do suffer from performance/memory bottlenecks problems. The keras.js library sought to address the problem by using WebGL in a clever way. Because WebGL can compute data directly in GPU memory, the performance does show significant improvement in lab settings. However, because WebGL cannot be accessed by WebWorkers, it can be difficult for production sites to adopt. More limitations on WebGL will be discussed in the section below.
5. Use Cases
Developers may use machine learning for a variety of purposes. Drawing inspirations from existing demos and production sites/apps, this section illustrates a few sample use cases. As mentioned above, this document is meant to inspire discussions in this space among browser vendors and the web development community.
5.1. Offline Recommendation Engine
A web application built with Service Work to be network resistant may wish to build its recommendation engine offline. For example, a site serving images/GIFs/video as content may wish to serve users smart content feed with content cached with Service Worker. Or a productivity application with many different features like Office may wish to provide Help when user is for users looking to know which feature when the user is offline or traveling with poor network.
5.2. Text Translation
A web application may wish to translate text from one language to another offline. The Google Translate service trained a machine learning model to translate between languages and ported the model to its mobile app. The mobile app can be used offline, though translation may be better online.
5.3. Object Detection from Images/Videos
A web application may wish to recognize objects from images or videos. For example, Baidu built convolutional neural networks (CNNs) into its mobile app so that the app can detect the main object in the live camera feed and search related merchandise based on the result (baidu deep learning framework).
In addition to generic object recognition, an application may wish to train their model to focus on a few classes of objects for the sake of more specific classification. For example, an application may want to let users give their credit card number with live camera feed. A generic text detection model may be a lot less accurate than a specific model trained with only credit card numbers. Or a web application for streaming/uploading videos may wish to p erform live check of the camera feed to ensure the user isn’t showing obscene content for law compliance purpose. Or a web application may allow users to diagnose whether they likely have skin cancers themselves with live camera feed (Esteve et al, 2017).
An application may also wish to let the front-end code to do identification and leave the task of classification to the back-end. Object detection usually consists of two stages: identification and classification. The detection model first identifies the objects in an image and then classifies the objects to know whether the object is an animal. For example, in the above skin cancer recognition example, the application may wish to let the front-end code to identify the mole and leave the task of classifying whether there’s cancer to the back-end.
5.4. Risk Analysis
A web application may wish to deploy a small scale risk analysis model to determine whether the transaction should be pre-approved and leave the final decision to the full-scale risk models on the backend. Quick pre-approval improves user experience while reduce cost for running the model.
5.5. Rich Interactive Experience
A web application may wish to detect user’s emotion based on user input and dynamically change the interaction model. For example, a social media site may wish to detect user’s emotion when user’s typing the post and recommend the right emoji to use. If the user wishes to post picture alongside the post, the application can also recommend appropriately based on the post.
5.6. Mixed Reality Experience
A web application built for mixed reality platforms may wish to leverage machine learning to anticipate user intention and movement in order to cleverly render content.
6. Related Research
The design of an appropriate API surface for machine learning inference should incorporate learnings from research about optimizing machine learning models to run on devices with low computational power such as IoT devices. The section covers a few sample techniques for inspiration purpose: quantization, huffman coding, discretization, and sparse matrix.
A common theme among the techniques is they are all trade-offs between accuracy and other qualities.
6.1. Quantization
Quantization refers to a group of techniques ot convert high precision floating point numbers typically used in the training phase to low precision compact format numbers. Doing so allows us to reduce the file size and accelerate the computation. This technique is particularly useful for DNNs.
During the training stage, programs typically compute in high precision floating point numbers. That is because the biggest challenge in training is to get the models to work and floating number is best at preserving accuracy. After all tasks like training neural network is essentially keep tweaking the weights of the network until a satifatory result is obtained. Plus developers usually have access to lot of GPUs during training and GPUs work very well with floating point numbers. Doing so would allow training to run a lot faster so to not waste development time.
During the inference, the main challenge becomes the shrinking the file size. As it turns out, converting 32 bit numbers into 8 bit numbers shrinks the file size and memory throughput by four times. The same goes for caches and SIMD instructions. Because many machine learning algorithms are now well-equipped to handle statistical noise, reducing precision often doesn’t lead to too much decrease in accuracy. Although low precision may not matter that much for GPUs, it can matter a lot for DSPs which are usually designed to operate with 8 bit numbers. Nowadays most computers including smartphones come with DSPs.
6.2. Huffman Coding
Huffman coding is a commonly used compression alogrithm that uses variable-length codeward to encode symbols. Studies suggest Huffman coding can usually shrink network file size by about 20% to 30%. The technique can be used after quantization to reduce size.
6.3. Discretization
Discretization is the process to transfer continious functions to to discrete numbers. Some may argue quantization is part of discretization. One thing to call out about this technique is that this really helps decrease power consumption.
6.4. Sparse Matrix
Most machine learning problems don’t involve a densely populated matrix. Adopting sparse matrix data structures and specifical numerical methods for those data structures can significantly reduce the size of the memory.
7. Existing Standard APIs
Today several standard APIs do exist to help developers make use of machine learning tehcnologies in their web applications. For example, the Web Speech API lets developers use speech-to-text bidirectional conversion technology. The WebGL API gives developers a path to leverage GPU in matrix computation, though the path is not straight forward as one may hope. The same goes for the WebGPU API proposal put forth by WebKit earlier this year. The Web Assembly API lets developers transpile their existing trained networks written in C++ to binaries that can be run on the browser.
However, none of the existing standards was created with supporting machine learning on the web in mind and therefore does not provide sufficent support machine learning on the platform. Although each of them helps with one subset of the problems, a generic solution is needed.
7.1. APIs Relied on Machine Learning Technologies
In the past few years, we have added support for a few new APIs that relies on machine learning technologies. The Web Speech API enables developers to easily convert text content to speech and speech content to text. Both features are possible because of advancements we made in the natural language processing field, a sub-field of machine learning. The Web Authentication API enables web developers to authenticate users with strong authenticators, such as fingerprint scanners, facial recognition systems etc. Biometric authenticators all employ machine learning technologies one way or another. The Shape Detection API, a recent addition to the Web Incubator CG, allow developers to detect faces, barcodes, and text in live or still images. Object detection technologies are often based on research in machine learning, which in turn furthered research in Image Signal Processors (ISPs).
One of the common motivations behind building the above APIs are the machine models are computationaly expensive. Yet it is unscalable to continue adding APIs to the platform for the reason of computational cost. There should be a generic solution that can bring down the computational cost of doing machine learning on the web platform.
7.2. WebGL
The WebGL API was designed to render 3D and 2D graphic content and make use of GPUs behind the scene when necessary. Given that most of graphic processing relies on matrix computation, web developers have developed libraries that wrap around WebGL to accelerate matrix computation. However, as I illustrate below, such libraries are not developer-friendly and often very taxing on memory. Take the example of this matrix multiplication method. The method has to first instantiates two RGBA texel array, tranpose one of the arrays, create 2 input textures and 1 output texture, activate the shader, bind input texture, set shader parameters, bind output texture, and finally call drawElements to calculate the matrix. After the calculation, it also has to unbind all the textures. A simple matrix multiplication should only need to instantiate one new matrix in memory instead of five ( two arrays and three textures).
Although next generation of WebGL API can include more support for direct mathmatical computation, one can argue that this goal is not within the charter of an API that is designed for drawing graphics. In addition, the next WebGL (WebGL 3.00 are still far away given that Chrome and Firefox has just implemented the support for the 2.0 version earlier this year.
7.3. Web Assembly
WebAssembly is a new low-level assembly-like language with a compact binary format and near-native performance. Programs written in C/C++ can be compiled directly to this format to run on the web. On the browsers, WebAssembly programs run in a sandbox that can be used alongside JavaScript.
As previously stated, systemic support for Machine Learning programs should aim for allowing programs to have the least memory needed, provide most performance support, and preferably ease developer pain in importing their trained model to the web. Mainstream machine learning frameworks usually can produce models in C++ format. Given the above three goals, WebAssembly seems like a fitting solution for ML.
However, the current WebAseembly design do have a few shortcomings when it comes to being applied to ML. First of all, WebAseembly does not have GPU support, a well-known performance accelerator for ML. Second, WebAssembly lacks support for running in WebWorkers. Because ML models can take up to several hundred megabytes and unpredictable, developers should be discouraged from running the models in the UI thread. Third, Bottlenecks brought by network conditions are often motivations behind doing ML computation on the client. Common matrix functions can be large in size. Because WebAssembly is running on a blank slate, the developers have to load related libraries by themselves. If the libraries are built into the platform, much less speed requirement is needed. For example, developers would have to define their own matrix/format data type.
7.4. WebGPU
WebGPU API is a new incubating API that aims at exposing modern GPU features. Its initial API set is a derivation from the Metal language. Prototype for the API has landed in WebKit.
Although the API aims at exposing low-level GPU functionalities, its initial API set is primarily geared toward graphics rendering and not direct mathmatical computation. Research has also shown that while GPU accelerates computing, chips can be designed in ways that make them much better at machine learning computing. For example, quantization, a common technique to shrink number to less-than-32-bit representation, has proven to be a efficient technique to reduce the size of programs. Companies have produced chips designed for machine learning for personal devices instead of using GPUs, such as Movidius' (an Intel company) Myriad VPU, the IBM’s TrueNorth chips, or Intel’s NervanaIf the aim of the WebGPU API is to expose interface for the modern GPU, it would not be suitable for the machine learning field.
8. Requirements
8.1. Challenges
Performing machine learning inference on the web faces challenges in several areas: file size, memory, performance, and power consumtion. First, machine learning models typically have a large file size. The streamlined mobile version of TensorFlow by itself can take around 10 megabytes. A full model would add at least about 10 megabytes. Second, ML programs, especially deep neural networks, often instantiates many matrices throughout the computation cycle and may exceed max memory limit. Third, as illustrated below, the current web platform has poor support for the typical kind of mathmatical computation that machine learning programs use. Finally, today if a model were to be run inside a web application, it can suck up significant power.
8.2. Requirements
Therefore the web platform should offer APIs that aim at the below goals:
-
Provide APIs that are generic and basic enough to cover large range of machine learning algorithms including deep neural networks, decision-trees, etc.
-
Optimize toward small file size and memory consumption.
-
Make of use of possible hardware acceleration, such as the use of ASICs/GPUs/CPUs/DSPs, and parallelization to maximize performance gain.
-
Balance between hardware acceleration and the level of abstraction needed for the web platform.
-
Designed with focus on inference instead of training, such as a easy way to port trained models into inference-ready models.
9. Strawman Proposal
The section includes a strawman proposal of API surface as the beginning point of the conversation.
[Exposed=Windows, SecureContext] interfaceMatrix
{ readonly attribute ArrayBufferrawMatrix
; readonly attribute unsigned longheight
; readonly attribute unsigned longwidth
; readonly attribute booleanisTransposed
; readonly attribute booleanisVector
; readonly attribute intbitSize
; // Represent which numerical representation is used Promise <Matrix>fromJSON
(Stringstr
); Promise <Matrix>fromArray
(ArrayBufferbuf
); Promise <Matrix>multiply
(Matrixa
, Matrixb
); // Multiply two matrices Promise <Matrix>div
(Matrixa
, Matrixb
); // Return matrix of element-wise division Promise <Matrix>mod
(Matrixa
, Matrixb
); // Return element-wise remainder of division Promise <Matrix>sum
(Matrixa
, Matrixb
); // Return matrix of sum of elements of both matrices Promise <Matrix>subtract
(Matrixa
, Matrixb
); // Return element-wise subtraction Promise <Matrix>square
(Matrixa
); // Return element-wise square Promise <Matrix>sqrt
(Matrixa
); // Return element square root Promise <Matrix>mean
(Matrixa
); // Return mean of elements across dimensions Promise <Matrix>min
(Matrixa
); // Return minimum of elements across dimensions Promise <Matrix>max
(Matrixa
); // Return maximum of elements acorss dimensions Promise <float>selectFloat
(Matrixa
, optional unsigned longx
, optional unsigned longy
); Promise <unsigned long>selectLong
(Matrixa
, optional unsigned longx
, optional unsigned longy
); Promise <String>toJSON
(Matrixa
); }; interfaceSparseMat
: Matrix { // to do }; interfaceDenseMat
: Matrix {}; interfaceSparseVec
: Matrix {}; interfaceDenseVec
: Matrix {};