This is the last lesson of the intro series. We hope you've had fun learning something new and useful. To place all the puzzle pieces together, let's have a recap with a step-by-step tutorial on how to build a scraping task from scratch. We'll walk you through the entire process from entering the URL to downloading the extracted data. Let's dive right into it.
For this example, we'll scrape article information from this sample URL:
1. Start a new task
Enter the target URL into the search bar. Click Start to create a new task
2. Start the Auto-detect
As soon as the webpage is loaded in the built-in browser, select Auto-detect web page data from the Tips panel. Octoparse will start detecting web page data right away.
3. Preview your data
Once the auto-detect process is complete, go ahead and check your data in the Data preview section. Click the trash icon to remove those that are not needed.
4. Save auto-detect settings
Go back to the Tips panel and check the settings below:
Check the Add a page scroll box if your target website is loading more items while the page scrolls
Check the Paginate to scrape more pages box if you'd like to scrape more than one page
Check if the correct pagination button has been selected on the website (highlighted)
Now, click Create workflow and Octoparse will auto-generate the workflow.
Apart from the listing page, if you want to scrape more data from the article detail page, please follow the steps below:
Click Select subpage URL
Choose the option Click on an extracted data field
Select Title_URL from the dropdown menu and click Confirm
Notice how an extra step gets added to the workflow which is the Click URL in the list step.
5. Select data from the subpage
You will now arrive on the article detail page. Once again, select Auto-detect web page data from the Tips panel.
Octoparse can automatically detect article title, content or author, etc.
TIP: The auto-detection process will start automatically. You can switch between the detected results until you have the right data selected.
Click Create workflow and the updated workflow should look like this:
You can also manually select the information on the web page to scrape data if the auto-detection does not work well on the subpage.
6. Clean the extracted data
Looking at the extracted data, there's something we would like to change. For example, we would like to reformat the publishing date to yyyy-mm-dd, therefore we need to use Clean Data to do so.
Click the More icon in the top right corner and select Clean data
Click Add step - Reformat date & time.
Select our target format.
7. Test-run the task
The scraping task is now completed. As mentioned before, it's always recommended that you test the workflow step-by-step, making sure that each step does what it needs to do. For example, if you click on Go to Web Page, it should load the web page in the built-in browser without a problem.
Launch the workflow and test run it by clicking through all the steps from top to bottom and inside to outside for nested steps (like pagination and loop item). Observe if the web page is responding as expected.
8. Schedule and run
Now that your task is fully tested and working, you can start a local run first to see if data can be scraped.
You can extract the data much faster by running the task in the Cloud or you can also schedule it to run on a recurring basis.
To start a cloud run, click Standard Mode or Boost Mode under Run in the Cloud.
To schedule the task, click Automation Settings and click Edit.
Pick your desired frequency and designate a day and time for the run.
9. Export your data
Go to the Task List to find your task and click open task status to view the data extracted. Click Export Data at the bottom and choose the format you'd like to download the data.
Congrats! You've done a good job of making this far and working your way to becoming the next web scraping expert. We hope this is not the end of your learning but the beginning of your web scraping journey.