Revisiting: what food and philosophy have in common on Wikipedia  

2026-05-27 · 3 min read

In August 2021 I wrote Network Analysis of Wikipedia: What Food and Philosophy have in Common - a small experiment that built a directed graph from the Wikipedia article "Food" by breadth-first crawling its outbound links to depth 2. About 40,000 nodes after dedup. The cleanest cluster that came out wasn't cuisines, as I'd expected - it was food ethics: Ethics of Eating Meat, Animal Welfare, Vegetarianism. Food, on Wikipedia, is closer to philosophy than to recipes.

Reading it now: the finding still holds. The method I would throw out and start over.

What I'd do differently

The 2021 script used the Python wikipedia package to fetch each page over HTTPS, parse it, and extract links. That's polite for ~10 pages and extremely impolite for ~40,000. Every node was an HTTP round trip plus an HTML parse. The crawl took hours and put load on Wikipedia's servers that they shouldn't have to absorb for a class project.

The right way, then and now, is the Wikipedia data dump. Wikimedia publishes the full link graph as downloadable SQL/XML at https://dumps.wikimedia.org/. You pull pagelinks.sql.gz once, load it into SQLite or Postgres, and now you have the entire Wikipedia link graph locally - no rate limits, no HTML parsing, no apologetic emails from someone's sysadmin. The same depth-2 crawl from "Food" becomes two SQL queries.

What I'd change about the analysis

The 2021 post used link presence as the only edge signal. Every outbound link counted equally. That overweights chrome (every food article links to "United Kingdom" or "United States"; that doesn't mean food is about the UK).

Better signals available now:

  • Link position - links in the first paragraph or infobox are far more topical than links buried in a "See also" section.
  • Click data - Wikimedia publishes per-article clickstream data showing where readers actually flow. That's a much truer "relatedness" signal than what an editor happened to link.
  • Embeddings - encode each article's lead section with a sentence transformer and use cosine similarity as a weighted edge. Catches semantic relatedness the link graph misses.

Clickstream weights would probably make the food/philosophy cluster sharper, not weaker. People who land on "Food" really do click through to "Ethics of Eating Meat" - it's a documented behavioral pattern, not just an editorial accident.

What still surprises me

I went in expecting the clusters to be cuisines. I came out with philosophy. Five years later, asking why food sits closer to ethics than to cooking on Wikipedia is still a more interesting question than the network analysis itself. Probably it says something about Wikipedia's editor base. Possibly it says something about food. The honest answer is I don't know, and the original post pretended I did.