Contents on web pages are usually organized in some kind of pattern. One of the most commonly seen patterns is a list. Here are a few examples of when content is laid out as a list.
Scraping a list is quick and easy with Octoparse's auto-detect feature. Based on its advanced algorithm, Octoparse is capable of auto-detecting items from a list and generating the task workflow automatically. With Octoparse Auto-detect, scraping the list couldn't be easier. Now let's see how it is done with an example.
This particular web page consists of items sharing the same structure. Each item contains a title, date, keyword, article...
Our goal is to get data extracted into excel like this:
Now, let's explore different ways to get this done in Octoparse:
You may need this link to follow through: https://www.octoparse.com/blog
1. Extract a list with Auto-detect
Once you've created a new task using the example URL, select "Auto-detect web page data. Octoparse will now detect any data on the page and you can click "Create workflow" to generate the workflow.
2. Extract a list manually
If for some reason the Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.
Load the web page in Octoparse and hover your cursor over the first item until the entire section gets highlighted in blue
Continue to click on the second item and you will find all you need on one page has been selected.
Choose "Extract text of the selected elements" and Octoparse will create a Loop Item automatically
You will notice that the first item is now highlighted in red. You can select the information like title, date and keyword from the highlighted area.
Select the title and choose "Extract the text of the element"
Repeat the steps to get other information
Double click on the field name to rename it if needed
Hover your cursor over the first item until the entire section gets highlighted in blue
You will notice that Octoparse detects sub-elements from the section and highlights them in red.
Choose "Select sub-elements"
Choose "Select all"
Select "Extract data". A loop item will be generated automatically to scrap the list of items on the page.
Tip: If you want to edit or delete the extracted data fields, you can click "Extract Data" and modify the fields on the Data Preview panel.