Skip to main content

Open Page (Go to Webpage)

Follow these steps to load a web page in Octoparse.

Updated this week

Building a scraping task in Octoparse always starts by loading one or more webpage URLs within the built-in browser, with which you can then proceed to build the scraping workflow. The specific step is called Go to Webpage.

It should be noted that using direct web page URL(s), such as the product URL(s), is always more efficient than starting the job with a general website domain URL.

So, let's get started if you have the target URLs handy!


1. Open Single Web Page

There is more than one way you can tell Octoparse to open a webpage in the built-in browser.

Let's say we'd like to scrape the following webpage on eBay:

1.1 Open a Webpage on the Home Page

On the Home page, there's a search bar. The search bar is good for searching relevant scraping templates or loading a webpage for a new task when you input a specific webpage URL.

  • Copy and paste the target page URL into the search bar and then click Start. A new task will be generated automatically.

1.2 Open a Webpage Using the Custom Task

  • Find What's a custom task on the homepage and click on + New Task

  • Paste the URL into the URL input and Save to start

  • A Go to Webpage action would be generated automatically in the workflow

1.3 Open a Webpage by Adding a Step in the Workflow

A Go to Webpage step can always be added to the workflow directly. It can be the first step of the workflow or anywhere on the workflow depending on when you'd need to open a webpage URL.

  • Move your cursor to the workflow and click on the + button when it appears to Add Step

  • Select Open Page from the menu

  • Then go to the Settings section and paste the URL into the URL field and Apply the change

mceclip6.png

2. Open Multiple Webpage URLs

You don't always have to start with a single webpage URL, instead, you can start with many webpage URLs when they share a similar web structure, such as the ones below.

A common use case would be scraping product information from an e-commerce website. You can first build a scraping task to fetch the product page URLs, then use the scraped URLs to build your second task to capture detailed product information. These URLs can be added at once and make the scraping process more efficient.

2.1 Open Multiple Webpages on the Home Page

  • On the Home tab, copy and paste all URLs into the search bar and then click on Start

  • A Loop URLs action would be generated automatically in the workflow. You can click the edit button to edit the list of URLs in the loop item.

  • Choose how you'd like to input the URLs.

    You can input the URLs manually, import the URLs from a file such as an XLS file, import the URLs from another task, or batch generate a list of URLs. Check Batch URL input for more detailed instructions.

If you choose to enter the URLs manually, you can directly copy the list of URLs from an Excel sheet and paste them into the URL Input. If not, please make sure you enter one URL per line.

  • Click Save

2.2 Open Multiple Webpages Using the Custom Task

  • Find What's a custom task on the homepage and click + New Task

  • Choose how you'd like to input the URLs.

  • Copy and paste the URLs into the URL Input and Save to start

2.3 Open Multiple Webpages by Clicking the 'Edit URLs' button

If you start your task with a single URL and want to add more later, just click the Edit URLs button.


3. Settings for "Go to Webpage"

Every website is different and no two networks are the same. This is why you'd always want to make use of the settings for the "Go to Webpage" step to make sure any special situation is accommodated properly.

3.1 General

  • URL: Change the page URL here if you need to open a different webpage URL

  • Load URLs in the loop: Option for opening multiple webpages

  • Time out: Adjust "Timeout" if the webpage takes more time to load than usual

3.2 Options

  • Before action is performed: Options to set a waiting period before the action is executed.

    • Wait before action: Set up a waiting time to ensure the webpage is fully loaded before the action is executed

    • Wait until a designated element appears: Instruct Octoparse not to execute the action until a designated element appears.

  • After page is loaded: Options for what can be done after the webpage is loaded

    • Use cookies: Use cookies to open the webpage (such as when login is required)

    • Scroll down the page after it is loaded: Add a page scroll-down to tell Octoparse to automatically scroll down the page to load more content as soon as the page gets loaded.

3.3 Retry

  • Retry the action when: Use Retry to reload the page based on a set of pre-defined conditions, for example, if the current page does/does not contain a designated page element.


4. Dealing with Webpage Not Loading

Sometimes a webpage does not load well in Octoparse's built-in browser. You may only get a blank page. In this case, you can try switching to another user agent and see if things get better.

  • Click Task Settings

  • Go to Run Options and find the Browser User Agent. Select a different User Agent from the drop-down menu.

  • Click Save to apply the new settings.

To check if the new UA works, you can refresh the page by clicking on the icon for "Reload webpage" and see if the webpage is loaded successfully.

There are many user agents to choose from and so you may need to experiment a bit to find out which one works for your target webpage.

Did this answer your question?