LLMs are still surprisingly bad at some simple tasks

AI LLM · 20 comments · 650 words · Viewed ~5,339 times

I asked three different commercially available LLMs the same question:

Which TLDs have the same name as valid HTML5 elements?

This is a pretty simple question to answer. Take two lists and compare them. I know this question is possible to answer because I went through the lists two years ago. Answering the question was a little tedious and subject to my tired human eyes making no mistakes. So surely this is the sort of thing which an LLM excels at, right?

Wrong!

Here's how the three big beasts fared.

ChatGPT

OpenAI's LLM does a poor job:

Good question! Let’s carefully check this.

[…Skip preamble…]

✅ Confirmed Matches

The following TLDs exactly match valid HTML5 element names:

.audio → <audio>

.code → <code>

.link → <link>

.menu → <menu>

.nav → <nav>

.video → <video>

This is wrong on two counts. First, it misses off several matches. Secondly, there is no .code TLD! There is a .codes, but that's not a match.

Google Gemini

Gemini shits the bed completely:

Here are the top-level domains (TLDs) that have the same name as valid HTML5 elements:

.address: The HTML <address> element is used to define the contact information for the author/owner of a document or an article.

.article: The HTML <article> element specifies independent, self-contained content.

.aside: The HTML <aside> element is used to define some content aside from the content it is placed in.

.audio: The HTML <audio> element is used to embed sound content in a document.

.body: The HTML <body> element defines the document's main content.

[…Skip a dozen more items…]

It just lists a bunch of HTML elements with no crossover to TLDs. Thanks Google. Thoogle.

Claude

Anthropic's AI does a half-arsed job.

Here are the TLDs that match HTML5 element names:

.audio - matches the <audio> element

.video - matches the <video> element

.data - matches the <data> element

.link - matches the <link> element

.menu - matches the <menu> element

.style - matches the <style> element

.select - matches the <select> element

Some other potential matches depend on how broadly you interpret "HTML5 elements":

.app - while there's no <app> element in HTML5, there was discussion of it during HTML5 development

.art - not a standard HTML element, though some frameworks use custom elements like this

The first seven are correct - although it ignores a dozen more correct answers. But then it does the LLM thing of trying to over-fit the data in a simulacrum of usefulness.

Don't believe the hype

To be clear, I would expect a moderately intelligent teenager to be able to find two lists and compare them. If an intern gave me the same attention to detail as above, we'd be having a cosy little chat about their attitude to work.

I've discussed before why people have such dramatically different experiences using AI. For some it is an all-knowing and ineffable source of wisdom. For others, like me, it is mostly garbage.

I think it comes down to how familiar you are with the domain and its constraints. When I watch a medical drama, I have no idea if they're using realistic language. It sounds good, but real doctors probably cringe at the inaccuracies.

AI seems plausible because it relies on the Barnum Effect - it tells people what they want to hear. It produces words which sound correct.

We need a new term for Gell-Mann Amnesia. Something that describes how an AI is convincing if you don't understand its reasoning, and close to useless if you understand its limitations.

20 thoughts on “LLMs are still surprisingly bad at some simple tasks”

@edent

To stave off some obvious comments:

yoUr'E PRoMPTiNg IT WRoNg!

Am I though? I don't think I've asked for anything particularly ambiguous. This task is a matter of fact, not opinion. If it can't cope with a fairly limited task with an easily testable end-state, why should I trust it with anything more complicated?

How much prompting would I have to do in order to get the correct answer? Would it be more or less effort than manually doing the work?

WHY Do YOu HaTE PROGREsS?

I don't. I hate the way this is being sold as a universal and magical tool. The reality doesn't live up to the hype.

Every time someone says they've made progress in AI, I try a few simple tasks that I'd expect an intern to be able to manage. I've yet to be impressed by anything these chatbots have regurgitated.

nExt yEAR lLMS WILl be GrEat

I've been hearing that for a while.

I can't wait until I can jack into the Metaverse and buy an NFT with cryptocurrency just by using an LLM! Perhaps I can view it on my 3D TV by streaming it over WIMAX? I'd better stock up on quantum computers to make sure it all works.

you WeRE PAId BY biG OIl TO sAY tHIs

(This is genuinely a comment I received on one of my solar posts!)

Over the years I've been accused of being in the pocket of Apple, Google, BlackBerry, Nokia, Tesla, Amazon, the fossil fuel lobby, the nuclear lobby, Chinese solar panel manufacturers, and a whole bunch more.

None of them have paid me yet. Any day now, I'm sure.

Reply 2025-09-21 12:36
1. Carl Breen
  
  You were prompting it wrong though. I did it in 3 steps. I noticed the model got aggressive at returning all TLDs, either linking me or telling me the list is too long. After I ordered it that I do not care it returned it. I named each Table 1, 2, and 3 and told it how to generate the 3rd one.
  
  It worked. The problem with LLMs is they want to be smarter than the users or refuse work and know better.
  
  Table 3 Element audio br hr li link style td th tr video
  
  Reply 2025-09-21 14:35
  1. @edent
    
    Your "Table 3" is missing .data, .map, .menu, and others.
    
    Reply 2025-09-21 14:41
malte

@blog I agree with you that we need the point made by Gell-Mann Amnesia applied to LLM. Why do we need a new term for it?

Reply | Reply to original comment on radikal.social 2025-09-21 12:48
news.ycombinator.com

LLMs are still surprisingly bad at some simple tasks | Hacker News

Reply | Reply to original comment on news.ycombinator.com 2025-09-21 13:02
Simon Willison

I think this illustrates one of the most unexpected limitations of LLMs - they're really bad at comprehensive, methodical problem solving across more than a few dozen items

Anything that involves methodically processing every item in a known long list frequently produces useless results

Reply | Reply to original comment on bsky.app 2025-09-21 13:27
Lau Gainpaulsingh

I have been a bit meta and asked ChatGPT about your opinion of ChatGPT and here is the result:

https://chatgpt.com/share/68cffaf6-9a70-8004-a452-e400bb11a548

I personally wouldn't go as far as knowing its limitations make it useless. On the contrary, I have got ChatGPT to refactor code for me and it has been phenomenal. I know it has made mistakes, but as an experienced software engineer, I know how to tweak things to get what I want, and I end up with very fast and efficient code. Of course, certainly it has to go through a testing process (especially because of the limitations), but you certainly can get good results.

Reply 2025-09-21 14:25
1. @edent
  
  Your argument is "it's good at doing stuff but it makes a lot of mistakes I have to correct"?
  
  That doesn't sound like it is phenomenal. How do you know which mistakes you haven't caught?
  
  Reply 2025-09-21 14:39
  1. Lau Gainpaulsingh
    
    In the same way that a thorough testing process will capture the mistakes that a human hasn't caught. That is the point of test teams. A testing process should capture the "trivial" mistakes that LLM will make, just like how it will capture more complex ones that humans (and AI) make. However, the refactoring has outweighed what I could possibly do manually. I am aware of the edge cases and test scenarios I need to do. Naturally I will only do this on code that isn't a huge project requiring hundreds of developers, but in large scale projects, you will also have large test teams.
    
    Reply 2025-09-21 15:06
Maz Atighi

Your observation that they’re terrible at some things is exactly right, and you’re not prompting it wrong, but I do think you’re using it wrong.

The question that correlates with usefulness is “Am I asking about something likely to be well documented in some textual medium?” If yes, you’ll likely get good results. If no, garbage.

Here are my last few ChatGPT queries, all of which gave me excellent answers. * How can I keep my charger from falling out of an airplane power jack? * What are the various ways of representing sports odds and what do they mean? * How are Indonesia, Malaysia, and Singapore related?

These days, these answers come with links to sources and they’re all quite accurate and much faster than sorting through search results.

There are more “think-y” queries that can be hit or miss, but asking about something novel, like your question in the post, is likely to be a bust.

Reply 2025-09-21 14:39
1. @edent
  
  The answer is well-documented though. The answer has been on my blog for a couple of years, and has been slurped up by various aggregators.
  
  Even if it hadn't ingested my work - both those lists are well-referenced.
  
  Reply 2025-09-21 14:42
  1. Mike
    
    I don’t think you’ve understood, Terence. You are at fault for asking the spicy autocomplete to do something it is unable to, which is incorrect usage. If you take the time to properly understand the myriad limitations and how to avoid them you will see how brilliant the technology is. Your next job could be in Prompt Engineering, you just don’t know it yet.
    
    Reply 2025-09-21 20:33
popey

@Edent At the risk of claiming "You're holding it wrong", the model makes a massive difference. I gave the same question to Google Gemini 2.5Pro "Thinking" (lulz) model. Here's the result.

gTLD: audio, date, form, link, map, menu, nav, select, style, video
ccTLD: hr, td, th, tr

I exported the result as a doc, which doesn't show the "Thinking" (lulz), or the Sources it quotes, so I pasted that in, in a different font. 🤷

https://docs.google.com/document/d/1WKhyjsD8sXx13JTRl1DveVqGxUE2fjN5Iig2lEbBU_s/edit?usp=sharing

Reply | Reply to original comment on mastodon.social 2025-09-21 20:10
1. @edent
  
  OK. But it missed out "br" - there's a <br> element and a .br for Brazil.
  
  It's also missing "search".
  
  It has made up the .nav domain, which doesn't exit.
  
  So perhaps both of us are holding it wrong 🙃
  
  Reply 2025-09-21 21:54
Maz Atighi

"The answer is well-documented though. The answer has been on my blog for a couple of years"

An argument in blog comments is giving me serious 2005 nostalgia. This is offered in the spirit of discussion, rather than argument, but something on one blog (or a dozen blogs) is not well-documented, nor would we want it to be considered so, because then LLM output would really be nothing but garbage. The scale of the documentation has to be far greater, or from very authoritative sources.

Reply 2025-09-22 04:37
1. @edent
  
  I've been hosting arguments here for many years - and hope to do so far into the future.
  
  Considering that most of the code generation appears to have been trained on half-arsed StackOverflow answers, I think it is fair to say that the output is closer to garbage than many would admit.
  
  As I say, both lists are very well documented and from highly authoritative sources.
  
  Reply 2025-09-22 08:54
MichalZ

Tried https://www.perplexity.ai/search/which-tlds-have-the-same-name-42yGbTY2SoSChRcBmqfdCg#0 with Research mode

Reply 2025-09-22 08:11
1. @edent
  
  So it ignores 4 generic TLDs and all the ccTLDs!
  
  Reply 2025-09-22 08:52
programming.dev

Reply | Reply to original comment on programming.dev 2025-09-22 20:15
More comments on Mastodon.

Trackbacks and Pingbacks

AI-firmojen epärehellisyydestä – Miika Hämynen

[…] välillä meiltä menee ohi semmosia juttuja mitkä on meidän omaa domainia, miten me voidaan tietää että vastaus on oikein tai väärin jollain sellaisella liiketoiminta-alueella jota me ei tunneta kunnolla? Jotkin vastaukset voi olla […]

2025-09-23 17:33

Share this post on…

20 thoughts on “LLMs are still surprisingly bad at some simple tasks”

Carl Breen

Lau Gainpaulsingh

Lau Gainpaulsingh

Maz Atighi

Mike

Maz Atighi

MichalZ

More comments on Mastodon.

Trackbacks and Pingbacks

What are your reckons? Cancel reply