Saturday Links: PDDL and Symbolic Planning, GDPEval, and Grabbing Context

Planning in LLMs, efficient attention mechanisms and breakthroughs in time series models.

Saturday Links: PDDL and Symbolic Planning, GDPEval, and Grabbing Context

This week, I was at the excellent APIDays "No AI with no APIs" event in London. Thank you so much to the team for the very kind invitation. A link to my talk slides is here; a longer write-up on that is coming up soon. In a busy week, Exa releases an MCP server, Salesforce's Mulesoft acquisition pays off even more with an entry into the agent frameworks market, and the TikTok algorithm looks like it will be managed by Oracle.

On to the main eye-catching bits of news. This week, with a scientific/technical lean:

  • Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning. In this paper on LLM reasoning, a team at Stanford successfully used training data representing planning tasks and solutions to train an LLM to improve its logical planning ability. The representation of plan challenges and solutions they used was a real blast from the past: PDDL, which was created in 1998. You can find the formal technical report here. The results clearly show that formal representation can help an LLM gain greater consistency in handling complex logical challenges. My guess is, though, that to get very high accuracy on planning tasks will require adding on an actual logical reasoning engine as a tool that the LLM can call.
  • GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. In this paper, OpenAI researchers carried out an extensive evaluation of LLM performance on tasks that are directly relevant to job roles held by humans in key industries today. The report starts by looking at the top 10 US industries by GDP contribution, identifying the key roles within each industry, and specifying the key tasks being carried out by experts. One of the most eye-opening parts of the paper is the final Annex, which lists the industries and roles along with estimated total remuneration for these roles. Across this (very approximately) $3-$4 trillion of annual remuneration, leading LLMs performed close to human expert level competence in many tasks. With human oversight, many of the tasks can be performed well (though in unspervised mode, there are also higher risks of critical failures). The results are very impressive, and models are still improving. Not only that, this is raw model performance. In many industries, a multitude of startups are building scaffolding and support for tasks that should improve outcomes. The team also open-sourced a set of 220 golden task examples. With results like this, it is hard to argue that AI solutions will not be eating into human labour budgets (and not just software) quickly.
  • Analog in-memory computing attention mechanisms for fast, energy-efficient large-language models. One of the challenges with the core attention mechanism in LLM transformers is that they require constant moving of tokens in and out of GPU memory. In this paper, published in Nature, the authors describe a mechanism that makes it possible to prime attention memory and update it incrementally. In their experiments on initial model (GPT-2 class) the methods have a between a two and four order of magnitude reduction in compute latency (speed) and potentially energy usage. The results need to be validated and tried at scale, but if realized, these techniques could lead to significant efficiency gains.
  • Introducing ChatGPT Pulse. Or otherwise known as the next shot in the war for personal context. ChatGPT now has a new feature on mobile that pops up suggestions and curated information for your activities that day. It can also connect to your calendar, which may be the real goal here. The service is useful sounding (though I'd argue you might be better off not killing your morning vibe with more pop-ups). This poses a significant threat to Google and Apple, as it involves taking over another aspect of personal information, including context and screen time. A push to integrate your email and DMs might not be that far down the line.
  • Time series foundation models can be few-shot learners. Rounding off the week with another scientific post. This week, Google released a model that is strong at predicting continuations of time series, which is an essential function in modern business. Previous techniques already made breakthroughs in that they created models that needed no domain-specific adaptation to work credibly. However, in time series tasks, accuracy is king, and in this new work, the Google team shows that with just a few injected examples, models can significantly improve in performance.

Wishing you a great weekend.