Last week, a colleague and I released an app: Genieous (App Store link). The concept is pretty simple: if you don’t know what to make with the food you have in the house, Genieous is a cheeky assistant that helps you create recipes. It’s a free app; you should download and try it.
While we hope that the genie can live off of tips and be self-sustaining, I had another motivation: AI is going to change the way we process information. Those who wish to wield AI as a tool should seek to understand it. And, the best way to understand anything is through real-world experience. Also, I frequently stare at the refrigerator wondering what to make for dinner, and I need inspiration.
I quickly realized that every benefit and challenge that exists in the world at large for AI also exists in this tiny little app, and it is a low-stakes way to explore ways of dealing with these issues. I’ll discuss all of these in more detail in future articles. These are some of the challenges that we needed to think through, even with this small-scale app:
Strategy
If you are thinking about incorporating AI into your product or business, one of the first questions you need to answer is: where will you get the most leverage out of using AI? With generative AI, especially large language models, you need to account for it giving wrong answers much of the time.
One way to think about this is in terms of convexity. The best framing I have found from this is from Nassim Nicholas Taleb who, in Understanding is a Poor Substitute for Convexity, defines it as “asymmetry between the gains (as they need to be large) and the errors (small or harmless)“.
The domains that have the property that the benefit of being right far outweighs the cost of being wrong are the most appropriate places to deploy generative AI. In the case of Genieous, cooking is naturally a trial-and-error process. If we can give someone a lot of ideas that they can take or leave as they see fit, the benefit (a great meal you can enjoy over and over again and impress people) far outweighs the drawbacks (a bad meal that you can throw away and order takeout instead).
User Experience
How do you take something that is very complex and open-ended and put it into a user-frendly product? How do you make the experience seamless and not feel like you are interacting with a chatbot? I really wanted to solve the problem of “what to make with the food I have” with the least amount of friction possible. I also wanted users to feel like they are focused on the problem at hand, rather than playing with a large language model.
Model Selection
There are thousands of different language models: ones that are small enough that can be run on-device to the top-of-the-line models from OpenAI, Google and Anthropic. They all have different capabilities and pricing models. Which one is the best for this application, assuming steadily growing usage? There are chat models and instruction tuned models, is one better than another?
Prompt Engineering
Language models can be very sensitive to how they are asked a question, and prompting techniques vary wildly across models. How should we ask an AI to generate recipes with a set list of ingredients. We want to get good results, but we also want it to be easy for the computer to interpret the response. And, since many vendors charge per input and output token, we want to be efficient in how we ask and what we ask for. Often, the ‘better’ a model gets, the ‘slower’ the generation can be.
Evaluation
I’ve used the word ‘better’ a lot without saying anything about what it means to be ‘better’. It is impossible to make informed decisions about what model to use and how to interact with it without some way of evaluating how good the answers you are getting are. How will you know if changes are better or worse? What does it mean for an recipe to be good? Is it more important for the model to generate interesting and creative recipes, even if it adds an extra ingredient or two? Should the model generate a different list of recipes every time when given the same ingredients? If two different models generate the same recipe, which one does a better job?
In the very beginning, you’ll probably base a lot of this on ‘vibes’. But you’ll want to keep track of what models, prompts and settings result in the subjectively best output. As a product gets more mature, you’ll want to augment that with some sort of rubric for scoring model performance.
Evaluating model performance requires data, and you will want to start collecting as much as you can as early in the project as possible.
Model Improvements
New models come out all of the time and new research is coming out on techniques to make generation better. Providing context, changing parameters like model temperature and fine-tuning are all techniques that can be used to improve model performance.
Alignment
Alignment refers to the process of steering an AI towards “a person’s or group’s intended goals, preferences or ethical principles.” [Wikipedia] Even a lowly recipe creation app touches on alignment. If a person has allergies, health concerns, or dietary restrictions, an aligned recipe generation AI will provide recipes that takes their needs into account.
But what if someone intentionally provides a bunch of ingredients that are not edible? What is the right thing to do? Should the AI refuse to generate recipes? Should it provide a warning and do it anyways? Should it happily play along with the human making the request? These are issues anyone deploying AI needs to grapple with.
Hallucinations
Hallucinations refer to the tendency of large language models to generate text that is not grounded in reality. Everyone who has used ChatGPT has noticed sometimes it will just make things up. In Why Reliable AI Requires a Paradigm Shift, Alejandro Morfis writes:
Rather than storing explicit factual claims, LLMs implicitly encode information as statistical correlations between words and phrases. This means the models do not have a clear, well-defined understanding of what is true or false. They can just generate plausibly sounding text.
This is a classic it’s even worse that it appears moment. Everything that large language models generate is a hallucination, luckily the statistically probable answer is quite often a good one.
Toxicity
Toxicity refers to the tendency of models to generate offensive content. Researchers struggle to refine this definition further, so it is often a “I know it when I see it” thing. A recent paper on this subject defines it as “stress by contradiction of accepted morality and norms of interaction with respect to the situational and verbal context of interaction.” A rich ethical debate can be had around this concept.
Even if there were objective criteria to define toxicity, it is impossible to eliminate. Again from Why Reliable AI Requires a Paradigm Shift:
Recent research suggests that if there is a sentence that can be generated at all, no matter how low its base probability, then there is a prompt that will generate it with almost 100% certainty
That means, if you allow free input, there is no way in practice to prevent a language model from producing toxic output.
In the case of Genieous, we can also take this more literally: if the ingredients or the way they are handled would be toxic, then this is a problem. Some ingredients are always toxic, and there are others that are dangerous if not handled properly. (Chicken is one of the most under-rated dangerous ingredients based on food poisoning incidents.)
Bias
On Measures of Biases and Harms in NLP defines bias as a “skew that produces a type of harm towards different social groups.” Much of the time, bias happens because the data that a model is trained on is skewed towards particular groups, or is missing data representing other groups. That causes language models to encode biases in its weights. It is impossible to eliminate bias, but being able to measure it, understand it and improve is a worthy pursuit.
The harms from a lowly recipe app from biases is limited, but we want to please every palate, and drawing from the entire world’s culinary wisdom will produce better outcomes for everyone.

Leave a comment