In this tutorial, we will show you how to extract text, URL, image URL, HTML, and other attribute values.
1. Extract Text
Click on your target data then select Extract text of the selected element from the tips panel
2. Extract the URL (of a link or an image)
A URL is a hyperlink. With a single click on a URL, you can open a new web page or go to a new website, just like what happens when you click on the title of a book on Amazon.
Besides a web page, the URL also enables you to access the specific file resource via the Internet, such as an image or a PDF doc. If you get the URL, you can download the corresponding file or image from the Internet via the URL.
2.1 Extract the URL of a link
Click on your target data then select Extract the URL of the selected element from the tips panel
TIP: When you select an item with a URL, the selected tag on the bottom of "Tips" should be "A", which stands for an anchor that usually links one page to another. Please make sure you select the right area.
2.2 Extract the image URL
Click on your target data then select Extract the URL of the selected image from the tips panel
FAQ: Can I use Octoparse to directly get an image, not its URL, from the web page?
3. Extract the inner/ outer HTML
Unlike the text and URL, data like icons are not available to be extracted directly. If you want to extract some visual non-text contents, like the star rating, you have to extract the inner/outer HTML of these contents.
Besides icons, you can also scrape hidden texts, charts, and graphs from a web page by extracting the HTML of these elements first. After getting the HTML code, you need to apply regular expressions to clean up the data.
To extract inner/ outer HTML, click on your target data then select Extract the inner/ outer HTML of the selected element from the tips panel
TIP: To refine the extracted inner/outer HTML into useful data, you might want to check out these tutorials -
4. Extract attribute value
Attributes are within the HTML code, providing additional information about HTML elements. For example, the star rating is usually stored in the attribute. It usually comes in name/value pairs like name="value". Octoparse can help to scrape the value directly.
Click on the target element (here we take the star rating as an example) and select Extract the text or HTML of the element
Go to the Data Preview section, hover over the name field, and click on the ... more button, select Customize field, then choose your target attribute in the Extract attribute