You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Medium is an open platform where readers find dynamic thinking, and where expert and undiscovered voices can share their writing on any topic.

This tutorial will show you how to scrape articles from Medium.

The URL being used in this tutorial is: https://medium.com/search?q=covid

The main steps are shown in the menu on the right. [Download task file here]

1. Create a Go to Web Page - to open the target website

Enter the target URL into the search bar on the home screen and click Start

2. Set up Pagination Loop - to scrape data from multiple listing pages

Click on Show more button
Click Loop click on Tips

Input Xpath for pagination as: //button[contains(text(),'Show more')]
Click Apply

Set up page scroll after new content loaded

Click Click to paginate step
Click Options
Tick Scroll down the page after it is loaded
Choose Scroll for one screen
Scroll 100 times
Click APPLY

3. Create a loop click step - to click into articles

Click on one title
Click Select all similar elements on the Tips after the title turns green, Octoparse will then select all titles

Click Loop click on the Tips panel

Modify Loop Item settings

Click Loop Item
Choose Variable List as Loop Mode
Input Matching Xpath as: //a[@rel="noopener follow"]/div/h2
Click Apply

Click on Click Item ->Options
Set up Load with AJAX timeout as 10s
Click Apply

4. Extract data - to choose the target data

Click on the wanted data

Click Text on the Tips panel

Untick Extract data in the loop for Extract Data step
Set up Wai time as 10s

Click Apply to save the settings
Double-click the header of the field to rename fileds

5. Modify Xpath for the data field - to locate elements accurately for every detailed page

Octoparse auto-generated XPath for data fields may not work for all pages. We can rewrite XPath for the elements to make sure they are being detected for every page.

Change the Data Preview to a verticle view
Input Xpath for the data fields below:
- author: //div[@aria-hidden="false"]/p/a
- published_time: //div[@class="ab ae"]/span
- title: //h1[contains(@class,'post-title')]
- sub_title: //h2[contains(@class,'subtitle')]

6. Create a Loop Item - to extract article content

Click the "+" button under the Extract Data step and choose Loop
Click Loop item > Switch Loop mode to Variable List > Paste the XPath //article/div/div/section/div/div/div/div/p in the Matching XPath box>Click apply

Click "+" icon and Choose Extract Data in Loop item1

Click Add Custom Field on the Data Preview panel and select Capture data on the page

Click Confirm

Click "..." > Select Merge field data

7. Add a Back to the previous page - to go back to the listing page

Medium website loads the detail article page with AJAX, so the article page will cover the previous listing page once we click open one article. In this case, we need to add a step to get back to the listing page.

Click "+" icon under Extract Data to add a step
Click Back to Previous Page

The final workflow will look as:

8. Run the task - to get the target data

Click the Save button first to save all the settings you have made
Then click Run to run your task either locally or cloudly

Select Run on your device and click Run Now to run the task on your local device
Waiting for the task to complete

Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.

Note: Medium requires a premium account to view more articles. You may need to log in to your account to get more data. Here is the related tutorial: Scrape data behind a login

Scrape questions from Quora

Scrape data from Duckduckgo search results

Scrape answers to a question from Quora

Scrape job information from Indeed

Scrape job info from Glassdoor

Scrape articles from Medium