Building a scraping task in Octoparse always starts by loading one or more webpage URLs within the built-in browser, with which you can then proceed to build the scraping workflow. The specific step is called Go to Webpage.
It should be noted that using direct web page URL(s), such as the product URL(s), is always more efficient than starting the job with a general website domain URL.
So, let's get started if you have the target URLs handy!
1. Open Single Web Page
There is more than one way you can tell Octoparse to open a webpage in the built-in browser.
Let's say we'd like to scrape the following webpage on eBay:
1.1 Open a Webpage on the Home Page
On the Home page, there's a search bar. The search bar is good for searching relevant scraping templates or loading a webpage for a new task when you input a specific webpage URL.
Copy and paste the target page URL into the search bar and then click Start. A new task will be generated automatically.
1.2 Open a Webpage Using the Custom Task
1.3 Open a Webpage by Adding a Step in the Workflow
A Go to Webpage step can always be added to the workflow directly. It can be the first step of the workflow or anywhere on the workflow depending on when you'd need to open a webpage URL.
Then go to the Settings section and paste the URL into the URL field and Apply the change
2. Open Multiple Webpage URLs
You don't always have to start with a single webpage URL, instead, you can start with many webpage URLs when they share a similar web structure, such as the ones below.
A common use case would be scraping product information from an e-commerce website. You can first build a scraping task to fetch the product page URLs, then use the scraped URLs to build your second task to capture detailed product information. These URLs can be added at once and make the scraping process more efficient.
2.1 Open Multiple Webpages on the Home Page
A Loop URLs action would be generated automatically in the workflow. You can click the edit button to edit the list of URLs in the loop item.
Choose how you'd like to input the URLs.
You can input the URLs manually, import the URLs from a file such as an XLS file, import the URLs from another task, or batch generate a list of URLs. Check Batch URL input for more detailed instructions.
If you choose to enter the URLs manually, you can directly copy the list of URLs from an Excel sheet and paste them into the URL Input. If not, please make sure you enter one URL per line.
Click Save
2.2 Open Multiple Webpages Using the Custom Task
Choose how you'd like to input the URLs.
Copy and paste the URLs into the URL Input and Save to start
2.3 Open Multiple Webpages by Clicking the 'Edit URLs' button
If you start your task with a single URL and want to add more later, just click the Edit URLs button.
3. Settings for "Go to Webpage"
Every website is different and no two networks are the same. This is why you'd always want to make use of the settings for the "Go to Webpage" step to make sure any special situation is accommodated properly.
3.1 General
URL: Change the page URL here if you need to open a different webpage URL
Load URLs in the loop: Option for opening multiple webpages
Time out: Adjust "Timeout" if the webpage takes more time to load than usual
3.2 Options
Before action is performed: Options to set a waiting period before the action is executed.
Wait before action: Set up a waiting time to ensure the webpage is fully loaded before the action is executed
Wait until a designated element appears: Instruct Octoparse not to execute the action until a designated element appears.
After page is loaded: Options for what can be done after the webpage is loaded
Use cookies: Use cookies to open the webpage (such as when login is required)
Scroll down the page after it is loaded: Add a page scroll-down to tell Octoparse to automatically scroll down the page to load more content as soon as the page gets loaded.
3.3 Retry
Retry the action when: Use Retry to reload the page based on a set of pre-defined conditions, for example, if the current page does/does not contain a designated page element.
4. Dealing with Webpage Not Loading
Sometimes a webpage does not load well in Octoparse's built-in browser. You may only get a blank page. In this case, you can try switching to another user agent and see if things get better.
Go to Run Options and find the Browser User Agent. Select a different User Agent from the drop-down menu.
To check if the new UA works, you can refresh the page by clicking on the icon for "Reload webpage" and see if the webpage is loaded successfully.
There are many user agents to choose from and so you may need to experiment a bit to find out which one works for your target webpage.