All Collections
Case Tutorial
Social Media
Scrape articles from Medium
Scrape articles from Medium
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Medium is an open platform where readers find dynamic thinking, and where expert and undiscovered voices can share their writing on any topic.

This tutorial will show you how to scrape articles from Medium.

1.png

The URL being used in this tutorial is: https://medium.com/search?q=covid

The main steps are shown in the menu on the right. [Download task file here]


1. Create a Go to Web Page - to open the target website

  • Enter the target URL into the search bar on the home screen and click Start


2. Set up Pagination Loop - to scrape data from multiple listing pages

  • Click on Show more button

  • Click Loop click on Tips

  • Input Xpath for pagination as: //button[contains(text(),'Show more')]

  • Click Apply

xpath.png

Set up page scroll after new content loaded

  • Click Click to paginate step

  • Click Options

  • Tick Scroll down the page after it is loaded

  • Choose Scroll for one screen

  • Scroll 100 times

  • Click APPLY

SCROLL.png

3. Create a loop click step - to click into articles

  • Click on one title

  • Click Select all similar elements on the Tips after the title turns green, Octoparse will then select all titles

  • Click Loop click on the Tips panel

Modify Loop Item settings

  • Click Loop Item

  • Choose Variable List as Loop Mode

  • Input Matching Xpath as: //a[@rel="noopener follow"]/div/h2

  • Click Apply

  • Click on Click Item ->Options

  • Set up Load with AJAX timeout as 10s

  • Click Apply


4. Extract data - to choose the target data

  • Click on the wanted data

choose.png
  • Click Text on the Tips panel

  • Untick Extract data in the loop for Extract Data step

    un.png
  • Set up Wai time as 10s

  • Click Apply to save the settings

  • Double-click the header of the field to rename fileds


5. Modify Xpath for the data field - to locate elements accurately for every detailed page

Octoparse auto-generated XPath for data fields may not work for all pages. We can rewrite XPath for the elements to make sure they are being detected for every page.

  • Change the Data Preview to a verticle view

  • Input Xpath for the data fields below:

    • author: //div[@aria-hidden="false"]/p/a

    • published_time: //div[@class="ab ae"]/span

    • title: //h1[contains(@class,'post-title')]

    • sub_title: //h2[contains(@class,'subtitle')]

data.png

6. Create a Loop Item - to extract article content

  • Click the "+" button under the Extract Data step and choose Loop

  • Click Loop item > Switch Loop mode to Variable List > Paste the XPath //article/div/div/section/div/div/div/div/p in the Matching XPath box>Click apply

  • Click "+" icon and Choose Extract Data in Loop item1

  • Click Add Custom Field on the Data Preview panel and select Capture data on the page

  • Click Confirm

  • Click "..." > Select Merge field data


7. Add a Back to the previous page - to go back to the listing page

Medium website loads the detail article page with AJAX, so the article page will cover the previous listing page once we click open one article. In this case, we need to add a step to get back to the listing page.

  • Click "+" icon under Extract Data to add a step

  • Click Back to Previous Page

back.png

The final workflow will look as:


8. Run the task - to get the target data

  • Click the Save button first to save all the settings you have made

  • Then click Run to run your task either locally or cloudly

  • Select Run on your device and click Run Now to run the task on your local device

  • Waiting for the task to complete

mceclip9.png

Below is a sample data run from the local. Excel, CSV, HTML, and JSON formats are available for export.

RESULT.png

Note: Medium requires a premium account to view more articles. You may need to log in to your account to get more data. Here is the related tutorial: Scrape data behind a login

Did this answer your question?