Open page
Updated over a week ago

Building a scraping task in Octoparse always start by loading one or more webpage URLs within the built-in browser, with which you can then proceed to build the scraping workflow. The specific step is called Go to Web Page.

It should be noted that using direct web page URL(s), such as the product URLs, is always more efficient than starting the job with a general website domain URL.

So, when you have the target URLs handy, let's get started!


1. Open Single Web Page

There is more than one way you can tell Octoparse to open a webpage in the built-in browser.

Let's say we'd like to scrape the following webpage on eBay:

1.1 Open a Webpage on the Home Page

On the Home page, there's a search bar. The search bar is good for searching relevant scraping templates or loading a webpage for a new task when you input a specific webpage URL.

  • Copy and paste the target page URL into the search bar then click Start. A new task will be generated automatically.

mceclip7.png

1.2 Open a Webpage Using the Side Navigation Menu

  • Click on the + New button on the sidebar menu then select Advanced Mode

mceclip0.png
  • Paste the URL into the Website box and Save to start

mceclip1.png
  • A Go to Web Page action would be generated automatically in the workflow

mceclip2.png

1.3 Open a Webpage by Adding a Step in the Workflow

A "Go to Web Page" step can always be added to the workflow directly. It can be the first step of the workflow or anywhere on the workflow depending on when you'd need to open a webpage URL.

  • Move your cursor to the workflow and click on the + button when it appears to add a step

mceclip3.png
  • Select Open Page from the menu

mceclip5.png
  • Then go to the Setting section and paste the URL into the URL field and Apply the change

mceclip6.png

2. Open Multiple Webpage URLs

You don't always have to start with a single webpage URL, instead, you can start with many webpage URLs when they share a similar web structure, such as the ones below.

A common use case would be when you are scraping product information from an eCommerce website. You can first build a scraping task to fetch the product page URLs, then use the scraped URLs to build your second task to capture the specific product information. These URLs can be added at once and actually make the scraping process more efficient.

2.1 Open Multiple Webpages on the Home Page

  • On the Home tab, copy and paste all URLs into the search bar then click on Start

mceclip8.png
  • A Loop URLs action would be generated automatically in the workflow. You can edit the list of URLs in the URL field of the Setting section when needed.

mceclip9.png

2.2 Open Multiple Webpages on the Side Navigation Menu

  • Click on the + New button on the sidebar menu then select Advanced Mode

mceclip0.png
  • Copy and paste the URLs into the Website box and Save to start

Choose how you'd like to input the URLs. You can input the URLs manually, import the URLs from a file such as an XLS file, import the URLs from another task, or batch generate a list of URLs. Check Batch URL input for more detailed instructions.

If you choose to enter the URLs manually, please make sure you enter one URL per line or you can directly copy the list of URLs from an Excel sheet.

56.png

2.3 Open Multiple Webpages by Adding a Step in the Workflow

  • If you need to add a list of URLs in the workflow, hover over where you'd like to add the steps and click the "+" icon.

57.png
  • Add a "Loop" item from the drop-down menu. When a Loop Item is added, double-click it to input the URLs.

58.png

Under the "Loop Item", select the loop mode as a list of URLs and click the edit button to input the URLs.

59.png

Save the settings and a "Loop Item" with "Go to Web Page" nested inside will be generated.

60.png

3. Settings for "Go to Web Page"

Every website is different and no two networks is the same. This is why you'd always want to make use of the setting for the "Go to Web Page" step to make sure any special situation is accommodated properly.

  • Time out: Adjust "Timeout" if the web page takes more time to load than usual

  • URL: Change page URL here if you need to open a different webpage URL

  • Before page render: Options for what can be done before the page loads

    • You can set up wait time to slow down the process whenever needed

    • Use cookies to open the webpage (such as when log-in is required)

61.png
  • After loading page: Options for what can be done after the webpage's loaded

The most frequently used option is adding a page scroll-down. This is where you'd tell Octoparse to automatically scroll down the page to load more content as soon as the page gets loaded.

First, tick the box for scroll down, and choose how you'd like to scroll the page, ie. scroll "to the bottom of the page" or "for one screen".

Then, set up "Repeats", ie. how many times you'd like to scroll down the page and "Wait time" for how long to wait between scrolls.

62.png
  • Retry: Use "Retry" to reload the page based on a set of pre-defined conditions, for example, if the current page does/does not contain a designated page element.

63.png

4. Dealing with Web Page Not Loading

Sometimes a web page does not load well in Octoparse's built-in browser. You may only get a blank page. In this case, you can try changing to another User agent and see if things get better.

  • Click on the setting icon

64.png
  • Go to "Browser Ver" under the "Run Settings". Choose a different UA from the drop-down menu.

65.png
  • When done, click "Save" to save the new settings.

66.png

To see if the new UA works, refresh the page by clicking on the icon for "Reload webpage" and watch if the webpage can now be loaded successfully.

67.png

There are many user agents to choose from and so you may want to experiment a bit to find out which one works for your target webpage.

Did this answer your question?