You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
Medium is an open platform where readers find dynamic thinking, and where expert and undiscovered voices can share their writing on any topic.
This tutorial will show you how to scrape articles from Medium.
The URL being used in this tutorial is: https://medium.com/search?q=covid
The main steps are shown in the menu on the right. [Download task file here]
1. Create a Go to Web Page - to open the target website
2. Set up Pagination Loop - to scrape data from multiple listing pages
Input Xpath for pagination as: //button[contains(text(),'Show more')]
Click Apply
Set up page scroll after new content loaded
Click Click to paginate step
Click Options
Tick Scroll down the page after it is loaded
Choose Scroll for one screen
Scroll 100 times
Click APPLY
3. Create a loop click step - to click into articles
Click on one title
Click Select all similar elements on the Tips after the title turns green, Octoparse will then select all titles
Modify Loop Item settings
Click Loop Item
Choose Variable List as Loop Mode
Input Matching Xpath as: //a[@rel="noopener follow"]/div/h2
Click Apply
Click on Click Item ->Options
Set up Load with AJAX timeout as 10s
Click Apply
4. Extract data - to choose the target data
Click on the wanted data
Click Text on the Tips panel
Click Apply to save the settings
Double-click the header of the field to rename fileds
5. Modify Xpath for the data field - to locate elements accurately for every detailed page
Octoparse auto-generated XPath for data fields may not work for all pages. We can rewrite XPath for the elements to make sure they are being detected for every page.
Change the Data Preview to a verticle view
Input Xpath for the data fields below:
author: //div[@aria-hidden="false"]/p/a
published_time: //div[@class="ab ae"]/span
title: //h1[contains(@class,'post-title')]
sub_title: //h2[contains(@class,'subtitle')]
6. Create a Loop Item - to extract article content
Click the "+" button under the Extract Data step and choose Loop
Click Loop item > Switch Loop mode to Variable List > Paste the XPath //article/div/div/section/div/div/div/div/p in the Matching XPath box>Click apply
Click Confirm
Click "..." > Select Merge field data
7. Add a Back to the previous page - to go back to the listing page
Medium website loads the detail article page with AJAX, so the article page will cover the previous listing page once we click open one article. In this case, we need to add a step to get back to the listing page.
Click "+" icon under Extract Data to add a step
Click Back to Previous Page
The final workflow will look as:
8. Run the task - to get the target data
Click the Save button first to save all the settings you have made
Then click Run to run your task either locally or cloudly
Select Run on your device and click Run Now to run the task on your local device
Waiting for the task to complete
Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.
Note: Medium requires a premium account to view more articles. You may need to log in to your account to get more data. Here is the related tutorial: Scrape data behind a login