All Collections
Case Tutorial
News Portal
Scrape articles from The Washington Post
Scrape articles from The Washington Post
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

As one of the most popular news websites in the U.S., the Washington Post provides mass news not only happening in America but also around the world. From the website, news for almost every field including politics, opinions, coronavirus, and sports could be found.

To get as much news as we want at a time, in this case, we will scrape data such as the title URL, news title, and published date for the Covid-related news posted on the Washington Post with Octoparse.

info.jpg

Target URL used below:

The main steps are shown in the menu on the right. [Download task file here]


1. Create a Go to Web Page - to open the target page

To start our scrape journey, the target URL should be input first.

  • Input the web page URL in the search box at the center of the home screen

  • Click Start to create a new task with Advanced Mode


2. Auto-detect web page data - to create a workflow

Octoparse's auto-detect function can identify the page structure and help to create a workflow quickly.

  • Click on Auto-detect web page data to start the detection automatically and wait for it to complete

detec.jpg
  • Once the auto-detection is done, click Create workflow to generate a workflow

create_workflow.jpg

The automatically generated workflow for this task would show as below:

workflow.jpg
  • Check the data fields in the Data Preview and delete unwanted fields or rename them if needed

    • Delete unwanted data fields directly by clicking More and Delete field

    • Modify the data field names by double-clicking the headers


3. Set up Pagination Loop - to scrape more results from multiple pages

To get more results, pagination for loading more results is needed.

  • Click the Load more results button at the bottom of the page first

  • Click Loop click on the Tips panel to generate pagination


4. Run the task - to get the wanted data

The final workflow will look as below:

final_workflow.jpg
  • Click the Save button first to save all the settings you have made

  • Click Run next to it and wait for a Run Task window to pop up

  • Select Standard mode under Run on your device section to run the task and wait for the completion

Here is the sample output data, which can be exported in Excel, CSV, HTML and JSON formats.

mceclip0.png

Note: Local runs are great for quick runs and small amounts of data. If you are dealing with more complicated tasks or a mass of data, Run in the Cloud is recommended for higher speed. You are very welcome to try the premium feature by signing up for the 14-day free trial here. Tasks can be scheduled hourly, daily, or weekly and data delivered regularly.

Did this answer your question?