All Projects

Product Metadata at Scale

4,300 Products, 40 Fields Each

PythonShopify Admin APIClaude AIPandas

A pipeline that enriched 4,300 health supplement products with complete metadata across 40+ fields: product copy, category classification, wellness goal mapping, dietary tags, ingredient highlights, and structured attributes.

The Problem

Hewyn had 4,300 products in Shopify with minimal metadata. Most products had a title, a price, and maybe a one-line description imported from the supplier. No wellness goal tags. No dietary classifications (vegan, vegetarian, gluten-free). No structured ingredient information. No proper product copy. No category taxonomy beyond what the supplier provided.

This meant the website couldn’t filter products by dietary preference. The recommendation quiz couldn’t map goals to products accurately. Email campaigns couldn’t personalize by product attributes. Search was basic keyword matching against sparse titles. Every customer-facing feature that depended on product data was limited by the quality of that data.

The Approach

I built a pipeline that processed all 4,300 products through multiple enrichment stages. For each product, the system:

  1. Classified the product into a consistent category taxonomy, normalizing the inconsistent supplier categories into a clean hierarchy.
  2. Assigned wellness goals from a predefined set (sleep, energy, gut health, stress, immunity, etc.) based on the product’s ingredients and existing descriptions.
  3. Tagged dietary attributes: vegan, vegetarian, gluten-free, dairy-free, soy-free, and others, cross-referenced against ingredient lists.
  4. Wrote product copy that followed a consistent brand voice and structure, replacing the supplier descriptions.
  5. Filled structured fields: serving size, form factor (capsule, powder, liquid), key ingredients, usage instructions, and warnings.

In total, approximately 40 fields per product. The enrichment used Claude AI for classification and copy generation, with rule-based validation layers to catch obvious errors (a product containing gelatin shouldn’t be tagged vegan).

The results were written back to Shopify through the Admin API in batches, respecting rate limits and logging every change for audit purposes.

Key Decisions

  • AI classification with rule-based guardrails. AI is good at reading an ingredient list and determining if a product is vegan. It’s not perfect. So every AI classification ran through validation rules: known animal-derived ingredients trigger a flag, known allergens get cross-checked, and any conflict between AI output and rule output gets queued for manual review.

  • Batch processing with full audit logging. When you’re modifying 4,300 products, you need to know exactly what changed and when. Every field update was logged with before and after values. This saved us twice when we caught classification errors in a specific product subcategory.

  • Consistent copy structure over creative variety. Every product description follows the same template: what it is, what it’s for, key ingredients, how to use it. This sounds boring, but consistency across 4,300 products matters more than any individual product having a clever description. Customers scanning a category page need predictable information architecture.

What I Learned

The hardest part wasn’t the AI or the API. It was defining the taxonomy. How many wellness goal categories should exist? Is “joint health” a subcategory of “fitness” or its own category? Does a multivitamin get tagged with every goal it supports, or just the primary ones? These are editorial decisions disguised as data architecture decisions. They determine what customers can find and how they find it.

The other lesson was about scale and trust. When you enrich 4,300 products, you can’t manually verify every one. You have to trust your validation rules and spot-check intelligently. I learned to check the edges: the products where AI confidence was lowest, the categories with the fewest examples, the ingredients that appeared in only one or two products. Errors cluster at the margins, not in the middle.


Built for Hewyn, a DTC wellness brand. Pipeline architecture shown. Product data anonymized.