Web Scraping

Legality? Ethics? Get outta here

Sep 24, 2022

Hi. I had a terrible idea as of late in which I tried to implement a web scraper so that I could have a bunch of Substack posts extracted and exported into documents. It didn’t exactly work how I had planned, but it was interesting enough for me to share.

Also, I don’t really know the legality of this walkthrough. It should be legal to scrape publicly available information, but I can’t make any guarantees. Assuming I have rights over what I’ve got posted on Substack, I promise to not sue any of you.

What

Web scraping is the action of extracting material off of a website programmatically. Say, for example, you have a bunch of files sitting on a website page (kind of like my Tor site), and you want get all of them. It would be really boring and time consuming to click every single file and download it manually, so why not make your computer do it automagically. There are even tools that already exist to scrape the web for you, but we’re going to build our own in Python.

There is also a security side to web scraping. Perhaps there are a list of phone numbers, names, addresses, whatever, that is listed on a website. If you want to extract all of that, you can just build a web scraper. If you manage to authenticate yourself somewhere within a website that you wouldn’t normally be allowed to go, then you can scrape all of the data in there.

Lastly, this is a bit how the wayback machine/internet archive works. It scrapes the webpage’s contents at a point in time so that it can be referred to later on to discover changes in the webpage.

How

I hope you know that websites are written in HTML. If you want to scrape websites, you’ll need to understand a little bit of how it works. I’m not going over it. Google it. Sorry.

The HTML code can be parsed with a Python library called BeautifulSoup or bs4. It’s what we’ll use today.

I’m going to walk you through a brief tutorial in which we will scrape my own Substack posts.

Methodology

Download the content of my Substack’s archive page.
Open the page content within BeautifulSoup.
Discover and extract all posts linked within the page.
Open the post pages and extract their content.

Requirements

Python with the BeautifulSoup library installed.
- pip install bs4
Some knowledge of HTML.

First things first, we need to set up our imports.

import requests
import sys

from bs4 import BeautifulSoup

And to make things easier, we won’t hardcode any inputs and just ask for the URL from the command line.

if len(sys.argv) <= 1:
	print("Usage: python3 stackscrape.py <URL>")
	sys.exit(1)

Now, we need to get the URL from argv, acquire the webpage, and send it into BeautifulSoup.

# Get the URL from the command line
url = sys.argv[1]

# Get the HTML from the URL
html = requests.get(url).text

# Parse the HTML
soup = BeautifulSoup(html, "html.parser")

If you send in a bad link, it will error. Duh.

Now, we need to find all of the posts within the URL. I used my archive page, https://bowtiedcrawfish.substack.com/archive. From here, we need to find all of the posts that have their URLs listed. Open up that link and press Ctrl+U (or right click and click “View page source”). It’s a bit hectic to explore, but if you want to prettify it, you can have bs4 print it out pretty for you with print(soup.prettify()) or write it out to a file.

To skip ahead of all the ugliness, we need to go here:

root
- entry
  - main
    - archive-page typography
      - container
        portable-archive
        portable-archive-list
        post-preview portable-archive-post has-image

Once we reach “post-preview portable-archive-post has-image”, we iterate through all of those nodes and go here:

post-preview portable-archive-post has-image
- post-preview-content
  - a

Once we reach “a”, we can simply grab the link from href.

posts = soup.find("body").find("div", {"id": "entry"})\
	.find("div", {"id": "main"})\
	.find("div", {"class": "archive-page typography"})\
	.find("div", {"class": "container"})\
	.find("div", {"class": "portable-archive"})\
	.find("div", {"class": "portable-archive-list"})\
	.find_all("div", {"class": "post-preview portable-archive-post has-image"})

# Use a list so that it's in order
links = []

for p in posts:
	# Get div class post-preview-content
	post_preview_content = p.find("div", {"class": "post-preview-content"})
	# Get link from a
	link = post_preview_content.find("a")["href"]
	links.append(link)


print(links)

I will run it at this point and get the below output.

>py .\stackscrape.py https://bowtiedcrawfish.substack.com/archive?sort=new
['https://bowtiedcrawfish.substack.com/p/video-games-external-cheating-basics', 'https://bowtiedcrawfish.substack.com/p/encoding-and-character-sets', 'https://bowtiedcrawfish.substack.com/p/shellcode-injection', 'https://bowtiedcrawfish.substack.com/p/dll-injection', 'https://bowtiedcrawfish.substack.com/p/dynamic-patching', 'https://bowtiedcrawfish.substack.com/p/password-attacks', 'https://bowtiedcrawfish.substack.com/p/shellter-and-meterpreter', 'https://bowtiedcrawfish.substack.com/p/road-to-malware-analyst-part-7-patching', 'https://bowtiedcrawfish.substack.com/p/for-college-students-ccdc', 'https://bowtiedcrawfish.substack.com/p/road-to-malware-analyst-part-6-anti', 'https://bowtiedcrawfish.substack.com/p/metasploit', 'https://bowtiedcrawfish.substack.com/p/road-to-malware-analyst-part-5-behavioral']

If you look closely, you’ll notice that it doesn’t grab all of my posts. This is the part I alluded to earlier where my idea didn’t work. However, it does grab 12 of them, which is pretty cool I guess.

We’re not done. Now we need to parse all of the links we just got, again. This shouldn’t be too hard, but there’s a slight issue. It appears that the post won’t load if you aren’t within the Substack website. The extracted HTML of a post page will look something like this:

We gotta dance around this. The reason why the post won’t load is because of Substack’s JavaScript. Unfortunately, removing JavaScript will remove some the fanciness that JavaScript enables on webpages, but otherwise we wouldn’t be able to read what we’ve scraped!

All that is left to do is to get the webpages from each link, extract the title of the post (so that we can name our local HTML files), erase Substack’s JavaScript that prevents the post from loading, then write the local copy onto our machine.

for link in links:
	# Get the HTML from the URL
	html = requests.get(link).text
	soup = BeautifulSoup(html, "html.parser")
	# Get page title
	title = soup.find("head").find("meta", {"property": "og:title"})["content"]
	# Get rid of "/" in title
	title = title.replace("/", "-")

	body = soup.find("body")
	for s in body.find_all("script"):
		try:
			if "substackcdn" in s["src"]:
				s.decompose()
		except:
			pass

	with open(title + ".html", "w", encoding="utf-8") as f:
		f.write(soup.prettify())

Note that the encoding has to be UTF-8 since my encoding post really makes file writing freak out.

After running the file, all of the scraped webpages are extract locally.

If you open any of the files, you can read them as if you were on substack.com. This can come in handy if you need some reading material while you’re in the middle of nowhere with no service. This isn’t as cool as I imagined it would be since it only extracts a handful of posts rather than all of them. Secondly, it doesn’t care if you’re subscribed or not, although that may be possible to circumvent if you mess with your browser cookies.

Last point: this can be highly unethical if use improperly. This is for learning purposes only. Please read and support local Jungle Substacks without using tools to manipulate the visibility of their content. You are free to abuse my Substack with this tool if you wish, but please keep others in consideration.

That’s all for now. Have a great weekend.

Go!

-BowTiedCrawfish

Gordon Fox

Dec 16, 2022

not sure how helpful this will be, but i found a div with the class name "visibility-check" that is blocking you. am working on finding a way to have the "load more stuff" execute with the 'get' request before iterating through the archive list

Expand full comment

Thunderclap

Sep 27, 2022

I wonder how it is unethical.. when you can read anything without subscription?

If I am not unsubscribed.. it is not possible to read the content? looks like basic security check for any website right?

1 reply by BowTiedCrawfish

1 more comment...

Shellfish Systems and Security

Discussion about this post