Contents on web pages are usually organized in some kind of pattern. One of the most commonly seen patterns is a list. Here are a few examples of when content is laid out as a list.
Scraping a list is quick and easy with Octoparse's auto-detect feature. Based on its advanced algorithm, Octoparse is capable of auto-detecting items from a list and generating the task workflow automatically. With Octoparse Auto-detect, scraping the list couldn't be easier. Now let's see how it is done with an example.
This particular web page consists of items sharing the same structure. Each item contains a title, date, keyword, article...
Our goal is to get data extracted into excel like this:
Now, let's explore different ways to get this done in Octoparse:
You may need this link to follow through: http://test-sites.octoparse.com/?page_id=6
1. Extract a list with Auto-detect
Once you've created a new task using the example URL, select "Auto-detect web page data. Octoparse will now detect any data on the page and you can click "Create workflow" to generate the workflow.
After that, you can modify the fields on the Data Preview
Delete unwanted fields
Rename the fields by double-clicking on the header
2. Extract a list manually
If for some reason the Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.
Method 1:
Hover your cursor over the first item until the entire section gets highlighted in blue, and click on it
Continue to click on the second item and you will find all you need on one page has been selected.
Choose Text and Octoparse will create a Loop Item automatically
Now, all the data are scraped into one field. You can select the information like title, date and keyword from the web page to create different fields.
Select the title and choose Text
Repeat the steps to get other information
Double-click on the field name to rename it if needed
Method 2:
Hover your cursor over the first item until the entire section gets highlighted in blue
You will notice that Octoparse detects sub-elements from the section and highlights them in red.
Choose Select all child elements
Choose Select all similar groups
Select Element data
A loop item will be generated automatically to scrap the list of items on the page.
The final workflow should look like this: