Now you have downloaded Octoparse on your device and learned about the basics, it is time to start your own web scraping project!
Most of the websites (directories, e-commerce, real estate sites, etc.) share similar layouts, eg. a page containing many items nested in a list. Let's look at a few examples.
Bestbuy.com
Amazon.com
Octoparse's brand-new auto-detect algorithm is specially designed to scrape pages of such a kind. It automatically detects listing data (including text elements and links), "Next page" buttons, "load more" buttons, and scrolls down a page, and then generates the scraping task automatically.
In this lesson, we will go through how to scrape webpage data by using the auto-detect algorithm.
STEP 1. Create a new task
Enter the sample URL (https://demo.octoparse.com/) into the search box at the top of the home screen. Click Start to create a new task.
STEP 2. Get data via auto-detect
Octoparse will load the webpage URL in the built-in browser and start the auto-detect process automatically. Please wait patiently until the process is completed and when more info is provided on the Tips panel.
Note:
If the data you need is not accessible upon page loading, you can try interacting with the web page before getting data auto-detected.
If this is your first time using Octoparse, the auto-detect feature will be enabled by default to help streamline the setup. Once you navigate to the target website, you'll notice that Octoparse automatically starts detecting the page. If you don't need this feature, you can disable it in Settings and click the Auto-Detect Web Page Data button to start the auto-detection manually.
STEP 3. Check the data
Once the auto-detection is completed, follow the instructions provided on the Tips panel and check your data in the Data Preview section. You can remove those that are not needed. The detected data will also be highlighted on the webpage for you.
STEP 4. Confirm your options
Now, go to the Tips panel and check your options. Based on the type of data detected, a number of options are provided for you to choose from. For this example, list data is detected so you are provided with the options to:
Extract the data in the list - This option is selected by default as Octoparse thinks this is what you need to do for sure.
Paginate to scrape more pages - Apparently, Octoparse has detected a "Next" button on the page. Check this option if you want Octoparse to click the "Next" button to extract data from more pages.
Add page scrolls - Check this option if a page scroll action is needed to load additional page content. In this example, uncheck it.
Note: To find out if the button detected is the correct one, click Check and watch if it gets highlighted on the webpage. If you need to re-select the "Next" button, click "Edit" and follow the instructions on the tips panel.
STEP 5. Create workflow
After confirming the settings, click Create workflow.
Octoparse would generate a workflow automatically based on the data detected and the saved settings. You can choose to run the task now or edit the workflow manually.
continue to >> Lesson 2: Optimize your task