Cloud Performance on a "Toy" Computer: From Python to Rust
It’s been a long, rewarding journey working on progscrape (nearly 15 years!), and I’m excited to share some milestones from this recent year, including an update on the major rewrite and the impressive performance benchmarks I'm seeing.
Rewriting for the Win
One of the most significant updates was the complete rewrite of progscrape in Rust. The new Rust stack replaced an ancient Python 2 AppEngine codebase that had been neglected for a while.
The original codebase had been running for more than a decade on AppEngine, and the usage fees had started to creep up to a few hundred dollars a year once Google discontinued the free tier. It was enough of an annoyance to see a monthly bill for a site that's purely a labour of love, and most of that being for storage of unaccessed, idle data.
AppEngine itself is more than 15 years old, and that SDK was really showing its age as well. Even modernizing this stack to a more recent Python version would be a pretty significant chunk of work.
This was a big undertaking, but the endeavour has paid off in terms of performance. The new Rust stack is based on Hyper + Axum, running on a Raspberry Pi 4 in my basement, and despite very modest hardware, it can keep up with a significant amount of search and browser traffic. I hit it with some load testing tools last year, and this setup easily handled up to 100 requests per second without breaking a sweat. It’s amazing to see how far a Raspberry Pi can scale!
Also, a huge shout-out to the authors of Tantivy, the fast, full-text search engine library written in Rust, which is the core of progscrape's search. Tantivy is incredibly fast, indexing the entire library of 1 million stories on a Raspberry Pi in seconds. For those looking for an alternative to Elasticsearch or Apache Solr, Tantivy is worth a look. It’s closer to Apache Lucene in design, being a crate to build a search engine rather than an off-the-shelf server.
Thanks to Tantivy, progscrape can easily service full-text search and a regular peak load of a few requests to a few dozen requests per second. Even under these conditions, the CPU usage barely spikes above a few percent. The library was pretty straightforward to integrate, has been incredibly useful with very few problems, and the team behind it was very responsive to bug reports.
If you want to see how responsive the search is on such a small device, try clicking the labels on each story—it's virtually instantaneous to query, and these queries can hit up to 10 years × 12 months of search shards! Check out a search for the 'javascript' tag.
Another major achievement was losslessly importing the nearly ten years' worth of stories archived on AppEngine into the new system. This means that progscrape is now indexing and serving a half-million stories on this tiny little ARM processor.
Story Pages
I'm excited about one major change on progscrape: story pages! Now, for each story in the database, you can access a dedicated page that aggregates all the relevant discussions and details about that story.
To view a story page, simply click the ...
icon next to the story's title in
the search results. This will open a details page for that specific story with
all of the links to discussions around the web, even if the story has been
submitted multiple times over the years.
Alternatively, you can navigate directly to a story page by using the following URL format:
https://progscrape.com/s/<url>
Here, <url>
represents the specific story's URL without the http:
or
https:
protocol prefix.
The search box also now includes the ability to search by URL. Enter a URL into the search box, and progscrape will load up the story page. This search will also provide links to discussions on platforms like Hacker News, Reddit, Lobsters, and Slashdot.
Search Trends
I've been thinking about what sort of data aggregation we can do with the data on progscrape. Since we have a long-term repository of programming stories, there's some analysis we can do with trends over time and how they are changing. I'd love to see more visualizations but the challenging part is making them useful!
The search pages for tags now include a graph of the story counts for that
particular tag over time. The UI for this is in development to make sure it's
useful. For now, it's a share of all the stories in a time period that share a
tag. Check out
the search page for the tiktok
tag as
an example.
I hope to add some additional filtering to search based on these graphs so that you can drill down by years or months to see what the top stories in a given topic were for that time. Let me know what you think!
Other Recent Changes
There are a few more improvements alongside these changes:
-
Improved Search Resilience: We've fixed issues that previously caused the web application to break during certain searches. Now, progscrape is more robust and will be better at interpreting search queries to return useful results.
-
Error Messaging for Empty Searches: Instead of showing a blank page, empty search result pages now display an error message. This helps users understand that no matching results were found, providing a clearer and more informative user experience.
-
Google Analytics Removed: progscrape is now using a small, lightweight, self-hosted analytics solution for better end-user privacy.
Android App
For those using Android, there’s still an open-source app available. You can download it from the Google Play Store here.
The app brings the convenience of progscrape to your mobile device, making it easier to stay updated on the latest stories and discussions.
Community
Although the major work on progscrape is now complete, I’m still tweaking the deployments here and there. The project is open-source, and I invite anyone interested in contributing to join in. The source code is available on GitHub.
Whether you’re curious about how any part of the stack works or interested in contributing to a Rust web project, your input and participation are welcome.
Wrapping up this major phase of progscrape feels like a significant milestone. The system is performant, the stories are plentiful, and the community has access to both web and mobile experiences. I’m thrilled with how the project has turned out and look forward to seeing how it evolves with community contributions.
Thank you for being part of this journey. Enjoy exploring progscrape and feel free to reach out to me at [email protected] if you have any questions or want to contribute!
Happy scraping!