Learning Et Al.
Learning Et Al. ("learning it all"). A daily research digest that finds, synthesizes, and contrasts academic papers and news articles based on your interests. Not an abstract delivery service. More like a curious friend explaining something over coffee.
I didn’t want to stray from the literature after leaving research and entering the workforce, but I didn’t want to read entire papers either. I wanted to see what’s out there and find new things to be genuinely curious about.
The Core Idea
The algorithm is backwards on purpose. Most recommendation systems find content first, then label it. This one generates a provocative central question before searching for papers: “Can AI agents be fashionable?” or “What if buildings could sense your mood?” Then it finds papers and news articles that serve as tools to think with in relation to that question, not papers that answer it directly, but papers that offer a surprising lens on it.
One Digest Per Day
You get one curated digest each morning. You can’t regenerate it. The constraint is the product: either you engage with today’s papers, dig deeper, ask questions, take notes, or you wait for tomorrow. This is anti-engagement-maximizing by design. The value is in curation, not volume.
Why Not Just Summarize Papers?
The Synthesis Pipeline
A single LLM call produces shallow summaries. I tried it. Then a 7-call pipeline, which still read like book reports. The current approach uses 15+ calls across 6 stages: first it extracts metadata, then builds a structural skeleton (which paper supports the argument, which complicates it, where the tension is) as JSON before any prose is written. Then prose generation, self-critique, and mandatory revision. The skeleton-first architecture is inspired by Yao 2023’s Tree of Thoughts and Madaan 2023’s Self-Refine.
Gap-Based Follow-Up Questions
The suggested questions come from a separate prompt that targets what the synthesis intentionally leaves out: “wait, but how?” moments tied to each paper’s most intriguing detail. Generic questions (“What are the implications?”) are banned. Answers are pre-generated at digest time with full paper context, so even logged-out visitors get instant, substantive follow-ups without any API call.
How Papers Are Found
Hybrid Ranking
Candidate papers are scored by both keyword matching (BM25) and semantic similarity (local embeddings), then combined via Reciprocal Rank Fusion, which sidesteps the problem of combining signals with incompatible scales. Venue and institution quality boosts push better sources up, and Maximal Marginal Relevance (λ=0.6) enforces diversity so you don’t get two papers from the same lab making the same point.
Local Embeddings
All semantic similarity runs locally via ONNX, zero API cost, zero external dependency. When the model can’t load on serverless cold starts, the system falls back to keyword overlap transparently, keeping the same API surface with a degradation flag for logging.
Staying Interesting
Theme Novelty
Each generated question is compared against the last 5 digests’ themes via embedding similarity. If the cosine similarity exceeds 0.5, the system forces a novelty rewrite with explicit instructions to pick different interest combinations. Without this, LLMs converge to a predictable question template within weeks.
Interest Decay
Interests decay daily (×0.95), recently-used topics get a frequency penalty, and selection is weighted random rather than top-N so even low-weight interests surface occasionally. Engagement signals are intentionally microscopic (+0.1 per star, +0.05 per question) after discovering that a single starred paper could pollute an entire feed.
Prompt Engineering by Antipattern
Instead of vague tone instructions, the synthesis prompts ban specific bad patterns by example: “The question of whether X isn’t just about Y, it’s about Z” (the worst one). Plus a hard banned-words list (demonstrates, reveals, highlights, nuanced, multifaceted), data-driven from observing every synthesis sounding identical.
Things I Reworked
Iterations
- —Anchor-paper → theme-first: Original approach derived themes from a “best paper.” Highly-cited papers dominated and pulled in wrong-field methodology papers. Eliminated the anchor entirely.
- —Paper selection: Citation graph (cross-field contamination) → keyword matching (terrible precision) → embedding-only (missed specifics) → BM25+embedding RRF with MMR diversity.
- —Synthesis: Paper-by-paper paragraphs (book reports) → single LLM call (too shallow) → 7-call pipeline → current 6-stage skeleton-first approach.
- —Theme revision: Tried letting the LLM decide whether to revise. It always said “no change needed.” Made revision mandatory. Output quality jumped.
- —News sources: Hardcoded RSS → DuckDuckGo scraping (broke on one CSS change) → Serper/DDG with User-Agent rotation and field-specific RSS fallback chain.
The Vault
Past digests live in a searchable archive where you can browse themes over time and compare any two papers side by side. Brutalist research archive aesthetic: hard borders, box shadows, crosshair cursor, accent colors only in tags.
