DALL-E 3 vs. Midjourney: Early Explorations

DALL-E 3 is impressive and will likely become my go-to image generation model in the future. Midjourney, though, still has the edge in photo-realism.

Oct 14, 2023

Having finally had access to image generation via ChatGPT, I couldn't wait to get my hands on DALL-E 3, which is now available everywhere - even for us privacy-loving Europeans 😅.

I've used Midjourney extensively in the past and have a relatively good grasp of its limitations. Thus, I was excited to see how DALL-E 3 performs in comparison to Midjourney.

The key takeaway: my sense is that DALL-E 3 performs substantially better than Midjourney. It actually creates the things you want it to, which often isn't the case with Midjourney. Let me briefly demonstrate this with three examples.

DALL-E 3 has better text rendering

Anyone who has ever attempted to render text on a generated image using Midjourney has likely been deeply disappointed. Take, for instance, the following prompt:

Can you generate an image for a T-Shirt that says "I am not a lawyer, but a scientist"

This is what Midjourney produces:

Funa ifyt sinelle ‘hetagr, if you ask me.

There are a couple of things to unpack here. First, I don't see any T-shirts. Second, these are presumably designs that could go on a shirt, but I didn't exactly ask for a rendering of Tom Selleck in the mid-19th century.

Most importantly, the text is a complete mess.

You could argue that this is a minor detail, but given that so much illustrative work includes text, this shortcoming has made Midjourney quite limited for many use-cases.

Now, let's give the same prompt to DALL-E 3 via ChatGPT Plus:

This is a substantial improvement. Most images still contain typos, but there is at least one perfect rendering of exactly what I requested.

Just say what you want

A second key advantage of DALL-E 3 is its user-friendly interface, which is already familiar to many: ChatGPT. You can simply request an image generation within the ChatGPT text box. My only gripe is that you have to specify image generation right at the beginning of the chat; it won't accommodate such requests if made in the middle of a general conversation.

Once you've generated an image, DALL-E 3 allows for iterative improvements. For instance, you can request that the T-shirt be worn by a person.

You can see the limitations of DALL-E 3 in this image. It's not able to take the exact same T-shirt from the previous image and render it on a person. This is likely due to the way the image is generated, and it may take some time to overcome this hurdle. The second limitation is that the rendering isn't as photorealistic as I'd like. This could be by design - OpenAI may have concerns about the potential misuse of DALL-E 3. However, my general assumption is that it's a limitation of the current model.

In any case, DALL-E 3 quickly generated an image that was exactly what I was looking for, something Midjourney couldn't accomplish.

DALL-E 3 has a better literal understanding of what I want

Consider this prompt:

I need an image of a broken light bulb, dimly lit, in front of the background of the Swiss Alps. It should be as photo-realistic as possible.

Here's what Midjourney generates:

I've tried numerous iterations of this prompt, and I simply can't get Midjourney to render a shattered light bulb.

Using the same prompt with DALL-E 3 yields the following:

Bingo.

I'd be extremely curious to understand what aspects of these two models make them either fail or succeed at this task. For now, it seems to me that DALL-E 3 has a more literal interpretation of my requests. I suspect that the number of images of broken light bulbs is far fewer than those of intact light bulbs, so an image generator's propensity to produce a whole light bulb could simply reflect data bias. Nevertheless, this didn't prevent DALL-E 3 from accurately rendering a broken bulb.

This is just one example, but it highlights a recurring strength of DALL-E 3: it is far better at translating my literal prompts into images. Midjourney, by contrast, seems to latch onto certain keywords and combines them into an image that, while possibly captivating, often lack crucial details.

Once again, I find the photorealism in DALL-E 3 to be somewhat wanting.

The DALL-E 3 interface is superior

Midjourney employs Discord, which has a poor interface:

It's cluttered with non-intuitive icons and endless channels. I'm also continually exposed to what other users are generating, which means they can see my creations as well - not ideal.

Once an image is produced, a menu appears that seems deliberately designed for maximum confusion:

In contrast, the DALL-E 3 interface is straightforward: it's simply the ChatGPT interface, no fuss. You can click on an image to view the following:

Interestingly, ChatGPT took my initial prompt and transformed it into a DALL-E 3-specific prompt, which I can then copy and further experiment with. In fact, each of the four generated images comes with a different prompt.

Leveraging the language model to create better prompts than the user originally typed is quite powerful.

In summary, my early explorations indicate that OpenAI's image generation model outperforms the competition in ways similar to its language model: it simply seems to have a better understanding of what the user wants.

I'm eager to see how this space evolves. I anticipate an explosion of experimentation now that DALL-E 3 is widely accessible.

CODA

This is a newsletter with two subscription types. You can learn why here.

To stay in touch, here are other ways to find me:

Social: I’m mainly on Mastodon, and occasionally on LinkedIn.
Writing: I write another Substack on digital developments in health, called Digital Epidemiology.

Engineering Prompts

Discussion about this post