What we are seeing in the AI startup space is a perfect example of the “no moat” problem: if your core product is essentially just clever prompt engineering wrapped around someone else’s frontier model, it is trivially easy for a competitor to reverse-engineer your workflow and undercut your price. Over the last few months, this lack of a defensible moat has triggered a rapid race to the bottom in automated peer review, moving from expensive managed services to open-source “bring your own key” (BYOK) scripts.
Here I am going to look at three tools specifically designed to review academic papers: Refine, IsItCredible, and Coarse.
Overview of the Tools
Refine: Refine positions itself as a premium, rigorous option for institutions, boasting testimonials from Ivy League professors and a high price point of $49.99 per review. It uses what it calls “massive parallel compute” to make hundreds of LLM calls to stress-test every line of a document.
IsItCredible: Built on the open-source Reviewer 2 pipeline, IsItCredible offers a standardized, pay-per-use middle ground with core reports starting at $5. It employs a clever “adversarial” architecture where “Red Team” agents try to find flaws and a “Blue Team” verifies them to prevent hallucinations.
Coarse: Coarse represents the logical endpoint of this race as an open-source “Bring Your Own Key” (BYOK) tool that lets you run complex multi-agent reviews locally or via OpenRouter. Because users pay the API costs directly instead of a markup, a comprehensive paper review is significantly cheaper.
The “LLM as a Judge” Problem
The hardest part of all this is evaluation. How do you know if the AI reviewer is actually good?
Refine relies almost entirely on anecdotal evidence. Their own FAQ essentially tells you to just try it and see the difference for yourself, claiming that general-purpose chatbots cannot match their depth even with expert prompting. This “try it yourself” approach is effective for marketing, but it isn’t a hard benchmark.
IsItCredible and Coarse are trying to be more systematic. The IsItCredible team released a paper, Yell at It: Prompt Engineering for Automated Peer Review, where they benchmarked their tool against five alternatives. They claim 15 wins out of 20 pairings. Similarly, Coarse claims to have been “blind-evaluated” against Refine and Reviewer 2, scoring higher on coverage and specificity.
However, we are still largely in the “LLM as a judge” era. These benchmarks often use another LLM to decide which review is better. It is circular logic. Until we have a “Ground Truth” dataset of known mathematical errors or logical fallacies in published papers, we are just measuring which AI writes the most convincing-sounding critique.
Because evaluation is so difficult, this software category risks becoming a classic market for lemons. It is incredibly difficult to identify substantive differences in quality between these tools without some external, hard benchmark. To truly evaluate if Refine’s expensive managed service is meaningfully better than Coarse’s open-source BYOK run, you have to verify the AI’s claims. But verifying those claims requires spending just as much time reading and reviewing the original paper as you would have spent just doing the review yourself from scratch. Without transparent benchmarks, users cannot easily distinguish high-quality rigorous analysis from convincing hallucinations, driving the market toward the cheapest option by default.
For those building AI tools, this entire space serves as a warning about the race to the bottom. I have previously written about deep research tools as another example of this phenomenon. If your only value proposition is a well-orchestrated prompt chain, open-source alternatives will inevitably compress your margins to zero. Eventually, the native GUI interfaces of the frontier models themselves may just become good enough that your specialized service isn’t even needed.
Meta
Did you like this post? Guess what, it was entirely generated via the Google’s API models (specifically the gemini cli). I have saved the chat session and log for how long it took here. You can see for yourself, I had a broad idea, asked it to review different materials, and then generate a post. I then iterated 25 minutes from start to finish in total.
The original post also is not flagged by Pangram as AI generated.
It definitely is not 100% my style (and to be clear this meta section is 100% hand written). The final paragraph about deep research tools I also struggled to get the model to say what I wanted – I wanted it to say “deep research tools are another example where this same situation will occur”. I am keeping the original 100% AI generated post for posterity though for folks to see what is possible with the current tools.
