I have to be honest, I don't think this is a good explanation. I don't know what differential programming is, but I'm fairly sure I have the mathematical background to understand it. But I didn't come away from this article with any confidence that I'm following along.
On a superficial level it seems like it:
1. Generalizes deep learning to an optimization function on decomposable input, and
2. Reduces the number of parameters required to learn the input by exploiting the structure of the input, thereby making learning more efficient.
Is that correct? Is it completely off? What am I missing? Is there any more meat to the article than this?
Could someone who has upvoted this (and ideally understands the topic well) provide a different explanation of the concept? It would be great if I could see a real world example (even a relatively trivial one) represented in both the traditional matrix computation form and the sexy new differentiable form.
Let me try to give a high-level explanation. As we have had success with using neural networks for different tasks, people began to ask what are the properties that make neural networks work so well.
When I mean work well, I mean a family of models which are generally easy to describe as well as be generally easy to train. Nearly all neural network models fit this bill as a they can be described with a loss function. A loss function being something which scores how accurately the current NN performs with a particular set of parameters. This loss function is then usually differentiable, and thanks to autodifferentiation tools, you can obtain the derivative of this function with respect to some parameters in this function. With this derivative, you can go ahead and run SGD to find the optimal parameters. We have now both empirical and theoretical evidence that SGD works fairly well for optimizing functions, particularly the high-dimensional ones neural networks describe.
So, the insight is we need a name to describe all these models which are differentiable and nice to train because of that. So now we can say what really ties CNNs, MLPs, RNNs, etc is not some biological metaphor, but that these are all expressible as some loss function we can find the minimum of using SGD.
Supervised learning is a classic inverse problem -- you know the desired outcome, but not how to get there. (think control systems)
Suppose you had a flexible model in whose parameter space you could search. A naive algorithmic approach would be to treat this as a brute force search problem. Maybe you can be slightly more clever algorithmically (better search algorithms, heuristics, etc)
But can you somehow be really clever and exploit the fact that the parameter space is continuous, with a smooth cost function? That's what gradient descent basically is. Lots of bells and whistles on top, to exploit extra problem structure.
Now, let's go back to our family of models. They just need to be transparent to gradient information -- then you could use simple ideas at the heart of adaptive control theory. Suddenly, the fact that autodifferentiation techniques can differentiate arbitrary computational algorithms (sibling comment explains how compositionality helps this) sounds mouth-watering! That's what makes a lot of people excited -- your model family just expanded to programs performing arbitrary computations on continuous parameters (including loops!) -- not just some silly function approximator like a natural network (exaggerating to make my point)
Last I checked, differential programming involved exploding matrix sizes at 2^n in the number of memory bytes that the differentiable program could access. Is the situation any better now? If not it seems kind of like training a function approximator map the present cycle's memory and register state to the i+1th state.
How does auto-differentiation apply to loop indices? Is this really any more computationally powerful than techniques used to solve integer linear programmes?
in that regard it’s mostly in the first few paragraphs but it’s subtley stated...
If you have a model and somecparameter space on it your trying to optimise then you can’t do much other than search the whole parameter space or approximate such a search intelligently.
However, if your model is differeniable with respect to its parameters then it means you’ve got gradient ... and if you’ve got gradient then you can cut out a lot of searching because you can use gradient ascent and other such techniques .
So basically: differentiable means can be differentiated which means has gradient which means search less.
The rest of the article basically says that’d be nice because there is a way to think about differentiability modularly and therefore there may be a route to Lego things together from across fields to make uber models which still optimise well but use specialist knowledge.
The parameter space stuff just means that less knowledge means more data and more parameters which is bad.
The basic idea is to have an expressive programming language where all constructs are differentiable. Since the composition of diffeomorphisms is a diffeomorphism, large programs (like ray-tracers) will be differentiable as a result.
I don't think you mean to use a term as strong as diffeomorphism. A diffeomorphism is differentiable, sure, but it is also invertable with a differentiable inverse. They have a lot of properties that machine learning honestly does not want, like preservation of dimension.
> The basic idea is to have an expressive programming language where all constructs are differentiable. Since the composition of diffeomorphisms is a diffeomorphism, (...)
No. It has nothing to do with diffeomorphisms (which are necessarily between spaces of the same dimension), but with piecewise differentiable functions.
Wow, I wish you'd written the article. Thank you. I can only imagine what you'd be able to explain if you had as much space as the author...
It sounds like you're just describing an evolution of functional programming; where each area of computation in the program is contextually generalizable, because its set of inputs and outputs is smooth and differentiable.
One followup question: are you sure the maps need to be (or are generally intended to be) isomorphic? That strikes me as very limiting. I follow your point about compositions of diffeomorphisms being diffeomorphisms themselves, but do you need that invertibility to make this paradigm work?
Differential programming is not tied to functional programming. The Python library PyTorch enables differential programming since you can use almost arbitrary python code and differentiate it, including if-then control flow, allowing you to use gradient descent to optimize the parameters of your model.
Traditionally deep learning just meant a sequence of functions applied compositionally (hence the "deep") where each function (termed a layer) is a matrix multiply followed by some well-behaved non-linear function ("activation function"). These were differentiable by design and optimized using gradient descent.
But now the models we want to build are more complex structurally than this merely sequential composition of functions. We want to be able to use control flow, accept multiple inputs, return multiple outputs, etc but we still want the model to be differentiable so we can use an iterative optimization procedure like gradient descent. So this extension from what deep learning traditionally meant (a fairly restrictive class of sequential function compositions) to complex, branching models are now termed differentiable programs.
For a concrete example, our recent paper RenderNet learns end to end differentiable ray tracing operations like ambient occlusion and normal maps which can be used in phong shading.
> Since the composition of diffeomorphisms is a diffeomorphism, large programs (like ray-tracers) will be differentiable as a result.
That's just bs. If you just read the ray-tracer article it clearly states that f(scene parameters) -> image is not differentiable -- simply put because of sharp edges of objects being rendered. Also the word `diffeomorphism` makes no sense at all in this context.
> Could someone (...) provide a different explanation of the concept?
Yes. It has nothing to do with deep learning, and nothing to do with optimization (although it is useful for both). It is just a way to write your functions so that they are easy to differentiate.
Differentiable does not mean easy to optimize.
One could imagine implementing sha-256 using differentiable operators, and yet the system as a whole would not be optimizable at all.
It would be interesting to have compilers that optimize the "optimizability" of differentiable programs tho...
Also, here are two interesting examples of differentiation through physical systems for classification:
Could someone list some practical examples where Differential Programming would be useful?
I am familiar where Nueral Networks and Convolutional Networks have done well especially around image processing etc.
But I can’t imagine where having differential code would help unless it is just tying multiple neural networks together in a continuous chain of differentiation.
For most programming tasks, I can’t imagine how differentiation would be possible or beneficial.
Is there a possibility that one could start with a series of unit tests and partial results and through gradient descent actually arrive at additional passing test cases? Most of the time in my experience, passing additional test cases like this requires significantly more complex structures that would not be found via differentiation.
There are examples quite far from neural networks, the ones I can think of are broadly optimisation problems:
Many physics problems involve trying to find a function which minimises something -- energy, entropy, action. Or the state which makes the difference between two things zero. Sometimes adjusting many parameters slowly down the gradient is a good way to find these.
In bayesian statistics, the basic problem is to sample from a distribution, which you know only indirectly, by some kind of monte-carlo method. But the space to sample can be enormous. If I understand right, advanced ways of doing this exploit the gradients (of functions defining the distribution) to try to choose samples efficiently.
People have hacked tensorflow to do all sorts of things which its creators didn't intend. Or written tools specialised for another particular domain (like Stan). I guess the excitement is that instead of re-inventing the wheel in each domain, maybe this can be pushed down to become a language feature which everyone above uses.
For instance in robotics, the inverse kinematics could be formulated using the well-known equations, and combined with a CNN for vision, maybe do reinforcement learning across the whole thing. Or in audio classification, could use standard filterbanks construction for creating a time-frequency (spectrogram) representation, then combined with a CNN for classification.
Right now these things are done either:
1) Unstructured end-2-end learning, where a deep neural network has to discover the laws of physics unaided by structure. The models are massively over-parametrized, require a lot of data to train and vulnerable to adverse inputs.
2) As independent systems, optimized separately. If the first model is complex, the model that follows must usually reflect this complexity. A simpler global solution might exist, but cannot be found.
It is however super early days. Right now there are hints that going in this direction might be fruitful, but as far as I know, not many concrete wins or a lot of practice. That will take some years still.
In quantitative finance automatic differentiation will give you sensitivities (which tell you how the value changes when interest rates and underlying prices and other factors change—very important for risk and hedging) for your products in a Monte Carlo simulation. And even better, you will get it for all times in the simulation.
Differential programming is a generalization of deep learning, so all the same practical examples from deep learning apply. The difference, however, is that in DP you can write your "models" simply as functions. This means than in an ideal DP framework, the code you write needs no special types and no special syntax. This would be particularly useful if you're using some other library's code which was not written with DP/DL in mind at all (think ODE solvers, optimization routines, etc.); that code can be added to your "model" (again, just a regular function), and it will just work!
The difficulties of course lie in implementing such a framework, verifying whether your program is truly differentiable, and so on. The author of this blog post has written a working prototype framework in Julia called Zygote.jl. In short, it achieves DP through metaprogramming, applying source code transformation techniques at compile time, and it has already been quite successful.
The thought on top of my head is the giant set of default parameters that sit on top of pretty much every program, set by intuition by the original coders and then never touched again. Define a loss function, and you can start to optimize those parameters "for free".
Something I don't understand about Automatic Differentiation is: Why not use a Computer Algebra System instead for generating derivatives of given functions?
That's essentially what a source-to-source AD does, just with support for the extra features that show up in programming languages. For example, handling variable bindings gets you the typical Wengert list, and handling function calls gets you the Pearlmutter and Siskind style backpropagator (I wrote a bit about the relationships at [0]).
The short answer is that CAS systems work with a "programming language" that doesn't have these features and is therefore a bit too limited for the kinds of models we're interested in.
You can still do that, most of the times. But if your function has many conditionals the final expression may get cumbersome.
Sometimes, the symbolic derivatives will be faster to compute, but most often they will be slower than using dual numbers for example. It is just a different tool to solve the problem. Most often, it is easier to use: you do not need a computer algebra system, you just change the type of your numbers and call it a day.
CASes generate symbolic derivative expressions. This is typically a fairly expensive operation, and can be slow if the number of expressions are large or if the expressions are complicated. You have to pre-generate these expressions before you can do any work (paying an upfront cost). CAS'es also use a very general symbolic engine, which entails a performance penalty.
If you're doing numerical work and don't need explicit derivative expressions, automatic differentiation offers a method that is both fast and convenient.
> Why not use a Computer Algebra System instead for generating derivatives of given functions?
You will generate expressions, and evaluating those expressions leads to the execution of exactly the same elementary operations as with forward mode AD. But with the overhead of needing the Computer Algebra System, and storing the expressions. (Ok, in some cases the CAS is maybe able to simplify the expressions a bit.)
When possible, one should link to the abstract page, if for no other reason than that it makes version tracking easier: Want, Wu, Essertel, Decker, and Rompf - Demystifying differentiable programming: shift/reset the penultimate backpropagator (https://arxiv.org/abs/1803.10228), a title which is entirely too cute for its own good.
Thanks Jade, didn’t realize I grabbed the wrong link.
This paragraph from the paper was helpful for me:
Differentiable programming is of joint interest to the machine learning and programming language communities. As deep learning models becomes more and more sophisticated, researchers have noticed that building blocks into a large neural network model is similar to using functions, and that some powerful neural network patterns are analogous to higher-order functions in functional programming [Fong et al. 2017; Olah 2015]. This is also thanks to the development of modern deep learning frameworks which make defining neural networks “very much like a regular program” [Abadi et al. 2017; LeCun 2018].
Be cautious using this article to try to learn anything. Differentiable programming is not actually related to deep learning; it's another word for automatic differentiation, a technique that is very important in deep learning implementations but useful for a variety of other tasks where having gradients available for arbitrary functions is useful.
The article is correct that "Differentiable Programming" seems to be a rebranding effort that I believe just helped automatic differentiation work from the machine learning world get published in Programming Languages journals. I wouldn't read too much into it.
The only serious difference between classical AD and newer efforts like this is that they have an intermediate language ("what goes on the tape") with recursion and control flow operators which ideally lets them implement front-ends for normal programming languages.
You can achieve the same effect with classical AD by doing what PyTorch does in its default mode where the dynamic recursion and control flow is normal Python code and large tensor ops are asynchronously scheduled on the target device and an AD tape for that particular dynamic run is constructed each time. But the viability of PyTorch's approach is domain dependent: there are relatively few but large tensor ops, and while host-device synchronization is possible (for making the dynamic control flow dependent on intermediate device-computed results) it's also very costly. Although the first of these would be slightly less critical if the host language executed faster.
But, if you're not trying to achieve the kind of decoupling between host and device which PyTorch and similar frameworks want, then I don't see any particular reason you can't just use normal forward-mode or reverse-mode AD as an alternate scalar execution mode for existing code, which just needs a different interpreter/library, not a different intermediate language. For the purposes of differentiation, the branching and control flow operators don't exist, anyway. E.g. if you represent branching explicitly in your graph/tape as TensorFlow does, the subgradient of if(a, b, c) is just the subgradient of whichever of b or c is selected. The conditional variable a only serves as a selector; if you compute the subgradient with respect to the selector, the answer is always 0 (except for the knife-edge case where it's technically not defined but is treated as 0).
I think it's just the realization that the execution graph in machine learning models at this point are not really different from any programming language AST, which means there is potential in exploring the intersection between writing programs and writing machine learning models with the AD tools.
It's ok, it seems at this point the focus is in creating the tools to better allow the exploration, including making the entirety of a programming language valid syntax for building any model (supporting the AD). There are the efforts in Julia Zygote and the Tensorflow for Swift that I know of.
I think the differentiable forth example in the article is interesting in the context, since it has a differentiable program with gaps, and it uses the universal approximation property of a neural network to fill them. When your code is differentiable, it's possible to embed ML models, perhaps to learn a part of an equation when you already know most of it, and which otherwise would have too large of a search space. You might even have, like other commenter said, a compiler being smart enough to rewrite the AST to reduce convergence problems (which seems to be the main problem with such models). Or you could download libraries with pretrained models/architectures in the same way you have any regular program library to embed in deeper systems.
Though I honestly can't tell if any of those are actually valid pursuits or I'm misunderstanding the possibilities.
On a superficial level it seems like it:
1. Generalizes deep learning to an optimization function on decomposable input, and
2. Reduces the number of parameters required to learn the input by exploiting the structure of the input, thereby making learning more efficient.
Is that correct? Is it completely off? What am I missing? Is there any more meat to the article than this?
Could someone who has upvoted this (and ideally understands the topic well) provide a different explanation of the concept? It would be great if I could see a real world example (even a relatively trivial one) represented in both the traditional matrix computation form and the sexy new differentiable form.