You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
As one of the most popular news websites in the U.S., the Washington Post provides mass news not only happening in America but also around the world. From the website, news for almost every field including politics, opinions, coronavirus, and sports could be found.
To get as much news as we want at a time, in this case, we will scrape data such as the title URL, news title, and published date for the Covid-related news posted on the Washington Post with Octoparse.
Target URL used below:
The main steps are shown in the menu on the right. [Download task file here]
1. Create a Go to Web Page - to open the target page
To start our scrape journey, the target URL should be input first.
Input the web page URL in the search box at the center of the home screen
Click Start to create a new task with Advanced Mode
2. Auto-detect web page data - to create a workflow
Octoparse's auto-detect function can identify the page structure and help to create a workflow quickly.
Click on Auto-detect web page data to start the detection automatically and wait for it to complete
Once the auto-detection is done, click Create workflow to generate a workflow
The automatically generated workflow for this task would show as below:
Check the data fields in the Data Preview and delete unwanted fields or rename them if needed
3. Set up Pagination Loop - to scrape more results from multiple pages
To get more results, pagination for loading more results is needed.
Click the Load more results button at the bottom of the page first
Click Loop click on the Tips panel to generate pagination
4. Run the task - to get the wanted data
The final workflow will look as below:
Click the Save button first to save all the settings you have made
Click Run next to it and wait for a Run Task window to pop up
Here is the sample output data, which can be exported in Excel, CSV, HTML and JSON formats.
Note: Local runs are great for quick runs and small amounts of data. If you are dealing with more complicated tasks or a mass of data, Run in the Cloud is recommended for higher speed. You are very welcome to try the premium feature by signing up for the 14-day free trial here. Tasks can be scheduled hourly, daily, or weekly and data delivered regularly.