About OpenNutrition

Josh Dickson, Founder

Background

About a year ago, following a trip to Guatemala, I decided it was time to move on from the startup I’d co-founded, where I’d led engineering for five years. I became interested in exploring how recent advances in generative AI could enable entirely new kinds of consumer products—ones whose core innovations leveraged AI but didn’t explicitly market themselves as “AI products.”

My initial interest was sparked by a late 2022 interview that Ben Thompson conducted with Daniel Gross and Nat Friedman about transformer-based LLMs and their potential to drive revolutionary products. Since then, conversational interfaces and direct interactions with language models have become ubiquitous. Yet, I see enormous untapped potential beyond chatbots—specifically, in creating compelling consumer products whose innovations quietly leverage generative AI to achieve outcomes that were previously impractical or impossible.

Given my background as an amateur powerlifter, strength athlete, and someone who has successfully maintained significant weight loss over the long term, the nutrition technology space was a natural area of focus. While MyFitnessPal dominates this market, its inconsistent and unreliable data frequently frustrates users, and its monetization strategies have grown increasingly aggressive and user-hostile. Premium competitors like MacroFactor demonstrate clear consumer appetite for better solutions but remain expensive and overly rigid, relying on exact database matches that, in my experience, don’t reflect real-world eating habits. AI-driven photo recognition apps promise effortless convenience, but significantly exaggerate their accuracy, serving better as supportive tools rather than comprehensive solutions.

Recognizing these shortcomings, I saw an opportunity to leverage generative AI to build a comprehensive, robust nutritional database—one that was accurate, extensive, and genuinely useful for everyday users. I also envisioned using AI to fundamentally improve user interaction with nutritional data: quickly surfacing relevant matches, accommodating realistic customizations, and gracefully resolving coverage gaps through on-the-fly AI-driven research. This ultimately led to the creation of OpenNutrition.

Current State of Nutritional Data

Nutritional data is inherently public information. In the United States, and many other countries, federal laws mandate that packaged food products clearly display nutritional details—such as calories, macronutrients (protein, fats, carbohydrates), vitamins, and minerals—on labels and packaging. Major restaurant chains must disclose calorie counts on menus and official websites, reinforcing transparency and empowering consumers to make informed dietary decisions.

Despite these mandates, publicly available nutritional data remains fragmented, incomplete, and difficult to effectively utilize. For example, the USDA, long considered a primary nutritional data resource through FoodData Central, has reduced the scope of its nutritional coverage in recent years. The USDA’s more recent “Foundational Foods” initiative encompasses fewer than 300 food items, reducing comprehensive coverage compared to its previous “Standard Reference” (SR) legacy dataset, which is no longer actively updated despite wide industry usage. While FoodData Central does include an extensive dataset of branded grocery products updated twice yearly (a key data source for OpenNutrition), it does not include restaurant-prepared foods, customizations, and evolving consumer products.

Commercial nutritional databases attempt to address these gaps by merging public data with proprietary estimation methods, especially for micronutrient data. However, their restrictive licensing, API usage limits, prohibitions on public-facing exposure, and high costs severely limit their practicality for innovative and open-source applications. These constraints restrict wider dissemination, limiting their value to consumers and developers alike.

User-generated databases like MyFitnessPal initially expanded rapidly by leveraging extensive public contributions to build comprehensive nutritional profiles. However, despite relying heavily on user-generated input, MyFitnessPal provides no public or transparent access to this data, offers minimal data-quality oversight, and explicitly prohibits external reuse or verification of their database. This proprietary stance significantly limits transparency, making it impossible for external researchers, developers, or consumers to reliably assess data quality or accuracy. In contrast, community-driven initiatives like Open Food Facts transparently publish crowdsourced nutritional data for branded products, though they deliberately avoid estimating incomplete nutritional details, particularly micronutrients—resulting in notable coverage gaps.

Data Methodology

OpenNutrition addresses these challenges through a novel approach. We start with authoritative public data from sources like USDA, FRIDA, AUSNUT, and CNF, augmenting it with carefully vetted web-based sources, such as academic publications, while also applying a filtering process designed to avoid inadvertently incorporating proprietary or copyrighted data. This involves a combination of automated similarity checks and selective manual reviews to flag potentially restricted materials. Advanced language models (LLMs) are then used to ingest, reconcile, and interpret varied and sometimes conflicting nutritional information, intelligently filling coverage gaps through internal reasoning.

Final nutritional data is generated by providing a reasoning model with a large corpus of grounding data. The LLM is tasked with creating complete nutritional values, explicitly explaining the rationale behind each value it generates. Outputs undergo rigorous validation steps, including cross-checking with advanced auditing models such as OpenAI’s o1-pro, which has proven especially proficient at performing high-quality random audits. In practice, o1-pro frequently provided clearer and more substantive insights than manual audits alone.

Transparency is foundational to OpenNutrition. While many everyday food entries clearly cite their public database origins, significant work remains in systematically documenting citations for data generated through AI inference or web extraction. Improving attribution quality remains a critical priority. We plan to refine automated processes that attach source references to each entry, while carefully avoiding disclosure of specific data points that may be proprietary. Recent advances, like Claude AI Citations, will significantly aid these transparency efforts.

Inherent limitations remain, especially regarding precise micronutrient details for branded and restaurant foods, which rely on estimates derived from ingredient profiles. However, these inaccuracies tend to be minor, and the varied nature of typical diets should minimize any practical impact. Additionally, the open nature of the dataset and integrated user feedback mechanisms continuously enhance accuracy and reliability.

As an active daily user myself, my experience consistently validates the dataset’s quality. Unexpected values tend to indicate my own genuine gaps in nutritional assumptions rather than dataset inaccuracies.

While extensive efforts have been made to ensure accuracy, OpenNutrition’s data explicitly does not replace professional medical, nutritional, or dietary advice. All information provided by OpenNutrition is for general informational purposes only. Always independently verify nutritional data when accuracy is critical, and consult a healthcare provider or registered dietitian for personalized dietary guidance.

Dataset Creation

The OpenNutrition dataset includes four primary categories:

Generic Everyday Foods represent unbranded grocery staples like fruits, vegetables, meats, grains, and legumes. These items tend to be best reported by major public databases like USDA SR/Foundational Foods, CNF, and Frida.
Prepared Foods consist of common home-prepared or restaurant-style dishes, typically underserved by public data sources due to ingredient variability and inconsistent reporting.
Branded Grocery Products are packaged goods identified by UPC barcodes, generally well-reported by the USDA, but with limited micronutrient data for non-mandated fields.
Branded Restaurant Products include foods from chain restaurants required (or choosing) to disclose nutritional details. This data requires manual extraction from official disclosures and in general contains minimal reporting of non-mandated vitamins and minerals.

You can search the dataset here and download it for offline use here.

Technical Approach

To ensure comprehensive coverage and accuracy, OpenNutrition employs a multi-model, AI-driven workflow.

For Generic and Prepared Foods, I used Claude Sonnet 3.5 to handle initial itemization, naming standardization, and serving-size alignment due to its superior rule-following, particularly at low and near-zero temperature settings. To expand dataset coverage, especially for diverse cuisines, I used persona-based prompts (primarily with OpenAI’s o1-pro) to surface realistic food items that potential users might search for.

To address coverage gaps, I used both manual prompts and automated workflows with o1-pro (leveraging its broad knowledge) to generate complementary, related foods. This approach gave the LLM creative freedom to identify additional items that deserved inclusion in the dataset. I validated the suggestions via a number of additional prompts and embedding-based similarity checks (using OpenAI’s text-embedding-3-large) to make sure that items added to the dataset were both unique and important-enough to merit inclusion.

For Branded Grocery Products, the dataset primarily originated with USDA data, supplemented by naming and serving size information from Open Food Facts.

Branded Restaurant Products posed more significant challenges. Initially, around 50 targeted searches were conducted using OpenAI’s DeepResearch, sourcing official nutritional disclosures from restaurant PDFs and websites. Extracting structured data from these reports required OpenAI’s o1-pro model, which proved to be uniquely capable at transforming large amounts of unstructured restaurant data into standardized JSON formats. Despite its effectiveness, o1-pro frequently encountered operational issues like timeouts, refusals, and inability to follow directions, which, combined with often slow response times of ten minutes or more, limited the degree to which this data was included in the first public data release.

The final nutritional dataset entries were produced using reasoning models (primarily o3-mini-high and o1-preview). These models were provided with extensive grounding data from authoritative non-commercial databases, data obtained from the web, and, when available, nutritional details of similar foods. Each model generated detailed textual explanations of nutritional values, field-by-field, followed by a summarized JSON representation. This two-step process facilitated immediate auditing of model reasoning, enabling prompt refinements for clarity and accuracy.

Though the final dataset reports nutritional values in per-100g increments, models generally generated data based on common serving sizes to minimize potential errors from unit conversions.

After generating initial data, I performed multiple validation checks to evaluate overall accuracy, including randomly selecting entries for auditing with o1-pro. While minor inaccuracies—primarily in micronutrient estimates—occasionally surfaced, overall nutritional profiles consistently fell within reasonable and expected ranges. Additionally, o1-pro’s detailed uncertainty explanations often exceeded the clarity achievable through manual web-based verification.

In total, creation of the dataset employed a variety of different models and products:

OpenAI: o1-pro, o1, o1-preview, o3-mini (high), 4o-mini, DeepResearch, text-embedding-3-large (embedding model)
Anthropic: Claude Sonnet 3.5
DeepSeek: R1
Google: Gemini Flash 2.0, Gemini Flash Thinking 2.0
Perplexity: Sonar

Standardized APIs and tools like Vercel’s AI SDK simplified experimentation across these models.

Credits provided by OpenAI/Microsoft, Anthropic, and Amazon supported initial research phases. However, infrastructure constraints—including Amazon Bedrock’s delayed general availability of prompt-caching for Claude models—limited efficient utilization of these credits. Recently, the emergence of more cost-efficient models, such as Google’s Gemini 2.0 series and DeepSeek, has significantly reduced model costs and complexity. This trend suggests further decreases in computational expenses and reliance on external credits in future dataset expansions.

Transparency and Data Provenance

Transparency is foundational to OpenNutrition. Our goal is to provide clear citations for all nutritional data in our dataset, including explicit labeling of how each entry was created—whether directly from authoritative sources, indirectly derived from multiple data points, or estimated by language models (LLMs) based on general world knowledge.

For most everyday foods, our dataset already includes detailed citations linking nutritional profiles to public database entries such as USDA FoodData Central, CNF, and FRIDA. However, significant gaps in attribution remain, especially where data has been sourced through web extraction, estimation processes, or AI-driven inference. Improving citation and attribution quality remains an ongoing priority, though it’s constrained by complexity, inference costs, and the inherent challenge of obtaining reliable citations from LLMs without compromising data accuracy.

To mitigate concerns about data provenance, I have proactively steered data collection efforts away from commercial nutritional databases—including those publicly accessible online. While we employ a robust filtering process and thorough reviews to avoid inadvertently including proprietary data, the frequent, uncredited republication of commercial nutritional information online complicates absolute certainty.

OpenNutrition respects the copyrights of proprietary nutritional data providers. If you believe our dataset contains your proprietary data, please contact me directly at legal@opennutrition.app with details about the specific entries in question.

OpenNutrition as its own first, best customer

OpenNutrition follows a model inspired by Amazon’s practice of being its own “first, best customer”—actively relying on its own products to drive continuous improvement. This means OpenNutrition comprises two complementary products: OpenNutrition Foods, the open-source nutritional dataset, and OpenNutrition Mobile, a commercial macro-tracking application built directly upon this dataset. This arrangement provides practical validation, drives feedback loops, and establishes a sustainable financial foundation for ongoing data development.

The expertise developed through building OpenNutrition Foods—including methods for quickly generating accurate nutritional data for diverse and ad-hoc food items—directly informed one of the mobile app’s core features: DeepSearch. DeepSearch solves a critical real-world problem users face when existing nutritional databases fail to deliver exact matches. When users encounter foods without database entries, DeepSearch employs real-time web research combined with advanced AI reasoning to instantly create accurate, usable nutritional information. For example, a barcode scan on an item not yet included in the dataset results in a real-time research phase to understand public knowledge about the food item. Importantly, these newly created entries are then fed back into the publicly available OpenNutrition dataset, steadily improving its coverage and usefulness.

OpenNutrition Mobile also includes straightforward reporting tools, allowing users to easily flag inaccuracies or request additions to the dataset. User-submitted data is automatically prioritized, verified, and integrated through AI-driven pipelines, enhancing the dataset’s accuracy and relevance over time.

In short, the commercial success of OpenNutrition Mobile directly finances—and continually enhances—the open-source dataset. This approach aligns commercial sustainability with OpenNutrition’s broader mission: providing a comprehensive, accurate, and genuinely user-focused nutritional resource accessible to researchers, developers, and everyday consumers alike.

Use Cases & Intended Audience

OpenNutrition primarily serves two audiences:

Consumers: Everyday users who rely on accurate nutritional data for health management. OpenNutrition provides accessible, real-world nutritional information, simplifying meal logging and dietary tracking, and addressing common frustrations with existing products.

Developers & Researchers: Professionals creating innovative health applications or conducting dietary research. OpenNutrition’s openly licensed dataset, robust AI-driven methodology, and detailed sourcing facilitate rigorous analysis, reduce development hurdles, and accelerate innovation.

The open-source licensing model, including its attribution requirements, explicitly encourages the creation of new products and services built on top of OpenNutrition. This structure not only supports a vibrant ecosystem of nutritional applications but also ensures that improvements and corrections from external use flow back into the open-source dataset, continuously enhancing data quality and comprehensiveness.

Roadmap & Future Directions

OpenNutrition remains committed to continuous improvement and frequent updates. Immediate next steps focus on addressing community-driven feedback and enhancing dataset comprehensiveness, particularly by integrating culturally diverse and internationally relevant food items. This includes proactively incorporating new foods discovered through user interactions with our DeepSearch feature, addressing real-world coverage gaps identified by our user base.

In the longer term, we aim to significantly enhance the dataset’s usability by improving citation accuracy, supporting multiple common serving sizes, and expanding micronutrient data coverage—particularly for branded and restaurant-prepared foods, where information is currently limited. Each of these strategic initiatives directly supports OpenNutrition's core mission: providing reliable, transparent, and practically useful nutritional information that empowers both individuals and innovators.

Crucially, these future enhancements will be driven by ongoing community engagement. By maintaining a continuous feedback loop—including bug reports, pull requests, and AI-driven suggestions—we aim to integrate contributions into the dataset.

Ultimately, our goal is to create the most comprehensive, transparent, and practically useful nutritional resource available. We welcome collaboration, feedback, and contributions from researchers, developers, and everyday users alike. If you have insights, suggestions, or questions—or simply want to get involved—reach out anytime at hello@opennutrition.app.