Introduction
I have unfortunately not been immune to the recent wave of interest over generative text-to-image AI models (see the hype over DALL-E 2 and Stable Diffusion), and have spent maybe a little too much time indulging in unfulfiled childhood dreams of being the next Picasso. While DALL-E 2, Stable Diffusion and other large models are extremely impressive, one idea I would like to share here is the usage of basic geometric building blocks on canvas to generate abstract art in the vein of pointillism and geometric art. When combined with a pretrained image-text encoder (CLIP), we can generate images by optimising directly using the encodings and the rendering tool, without any training necessary. This “CLIP guidance” is similar to works done in VQGAN-CLIP, DiffusionCLIP, and many others done at the community in EleutherAI. However, what is different in our case is that instead of using a GAN, or any pre-trained generative model, we use a simple differentiable rendering software instead. Our idea is also heavily inspired by CLIP Draw, but here we further build on their idea with more basic geometric blocks instead of vector strokes to focus on more abstract generation.
Furthermore, to see how a more “modern” model reinterprets our produced art pieces, we will use a fine-tuned Stable Diffusion model “dream” (really it’s just interpolation, but that sounds less cool) using our images as conditional embeddings.
Architecture
Our architecture hinges on two key components, firstly, CLIP, which, as a joint image-text encoder provides a way to compare between candidate images and prompt text which we can embed and score using cosine distance.
Secondly, using diffvg, a differentiable vector graphics renderer, for which we can optimise over. Specifically, since our renderer is differentiable, we can use gradient descent for fast optimisation.
One important caveat is, as noted in VQGAN-CLIP, the need for image augmentation as the gradients are often noisy estimates and these augmentations provide more coherent and understandable images.
Abstract art
With that, using our renderer, we can generate a variety of geometric shapes for our images. Here, I focus on two types, pointillism using dots (circles) and geometric art using triangles, as well as a final comparison with the stroke based model used in CLIP Draw.
With points
We can attempt to simulate the process of pointillism technique by using fixed sized vector circles, and then optimising over the position and color of these circles. In Fig.2, we use 10 000 of such circles. We can see the process generating smaller details, such in the “painting of an old man and the sea” the hand of the old man gripping a walking stick. However, there is still significant noise in the residual circles which do not model specific semantic features (note the sparkling effect).
With triangles
Using triangles instead produces a more abstract interpretation of the prompts and since there is less expressive power using triangles, it is less able to model finer details. We use 100 triangles and similarly optimise over the color and position of these triangles. Interestingly, using triangles was used in another project (here)[https://es-clip.github.io/], an evolution strategy algorithm was used instead. Another extension of our work here would be to investigate the effect of the type of optimisation algorithm beyond the gradient based ones we used. Although, ES algorithms notably do not scale well, and hence such algorithms will be limited to simpler drawings.
With strokes
Finally, we illustrate some of the images produced by CLIP Draw, which uses directly strokes, producing a more cartoonish effect.
Stable Diffusion’s Interpretation
Having played around with the fantastic open-sourced Stable Diffusion model, one of its strengths is how well it is able to replicate the “style” of characteristic artwork. Here, we can test Stable Diffusion and our produced art by feeding our produced images to Stable Diffusion. Unfortuntaly, as Stable Diffusion is itself also a text-to-image model, so providing images as an input cannot be directly done. While the provided img2img script allows us to input images, it still requires a prompt and does not preserve the semantic meaning of the image. We thus utilise a Stable Diffusion model that is fine-tuned on CLIP image embeddings instead of the usual CLIP text embeddings.
img2img and Image Variation
As Stable Diffusion (at the time of writing) is only a few months old and I have not seen a clear explanation of this available, I will go into further details of the differences between img2img and the fine-tuned image variation model. In the interest of brevity, I will presume knowledge of Stable Diffusion and (latent) Diffusion models.
The img2img script uses the (encoded into latent space) input image by first adding some noise, then utilising this noisy latent as the input for the (reverse) diffusion process in our model instead of simply random noise. On the other hand, this image variation model uses CLIP (input) image embedding as the conditioning instead of the usual text prompt. However, this requires further fine-tuning of the current Stable Diffusion model.
The implication of all this is that whereas img2img is able to preserve the structure and position of the input images, it is not able to capture the semantic meaning of the image in the same way that the image variation model is able to. The image variation model, by simply conditioning on the input image embedding, is able to generate a output images that may have varied structure but still have the same semantic meaning, which is exactly what we would like to do in this case, i.e., to preserve the “technique” which we used in generating our images. If what was desired was to improve on a rough sketch, then img2img would be more suitable.
Dreams of Stable Diffusion
We can further interpolate between random noise before our forward diffusion process to produce a video of Stable Diffusion “dreaming”. This is inspired from Andrej Karpathy’s script, although modified for the fine-tuned Stable Diffusion model.
We can see Stable Diffusion successfully preserving the semantic meaning and the style of the images we produced.