Fatih Kalifa | Interface Engineer

When GitHub announced Actions few months ago, I was really excited. One of the interesting feature was the ability to create a scheduled workflow using cron syntax. Of course, cronjob is not a new technology, but the ability to run cron without having to manage server while at the same time integrates nicely with GitHub API is pretty amazing. One thing immediately caught my mind:

I wanted to build a scraper.

I've been investing some of my money in few crowdfunding platforms, one of them is Tanifund. Unfortunately, they have 1 glaring flaw: there's no proper notification system for new funding oportunities. This means I had to regularly check up their website, see if there's new funding project, and only then invest by transfering money from my bank account.

That's assuming I'm not late to the party, which I usually was, especially when it's a 18%+ p.a return.

So I did what I had to do. No, it's not contacting them to ask for this feature. I needed to build a scraper to notify myself. I want to be the first to know if they posted such opportunities.

Serverless Scraper

The first thing that you need to do when building a scraper is finding the kind of scraper you need to build to do the task. In my opinion, there's 2 kind of scraper: HTTP-based scraper, and browser-based scraper. In HTTP-based scraper, you get the response using single HTTP request, either from HTML or a somehow-public API endpoint, while in browser-based scraper, you start new browser process that can execute JS if needed in case the page you're trying to scrape is rendered client-side.

Luckily for me, Tanifund have public API endpoint that I can hit directly without any authentication. I can simply use HTTP-based scraper.

Now that I know what kind of scraper I needed to build, the next step was determining how I wanted to be notified. There's a few channel that I can use but I settled on using Slack webhook because it's by far the easiest to get it up and running.

To use slack webhook, I need to provide webhook URL that my scraper will call whenever it finishes scraping. This is a quite sensitive URL that can be abused if I simply hardcoded in the repository. Thankfully, Actions integrates nicely with GitHub. I can create new secret in repository setting and read that inside Actions using environment variable.

Problem solved.

The next task was determining how to properly notify only on new funding oportunities. I needed to keep track of what's been scraped and what's new, something like a database. Integrating it to proper database seems overkill, so I came up with good enough database implementation: a text file.

I can store all of funding IDs in a CSV, and everytime a scraping is scheduled, I can compare new IDs with stored IDs. For every ID that's not in the CSV, I send single slack message using multiple attachments feature.

Where do I store this CSV? In the same repository with my scraper of course!

The other genius thing about Actions is that it automatically provides GITHUB_TOKEN environment variable that have write access to the same repository it executes from. I can create a commit from Actions without having to set up personal access token or any other authentication mechanism. The best part? It has write access only to that specific repository and nothing more.

So there you have it, a scraper, a notification service, written purely in JS, without having to manage any server, and a database. It's a perfect system for this use case.

You can see it in action 😉 by visiting https://github.com/pveyes/scarecrow.

Serverless Scraper & Notification Service using GitHub Actions

Scheduled Workflow and Git Commit as Database Write

Serverless Scraper

Categorized under

Webmentions