Many product web pages use image carousels (like the one below) to display multiple images as slides which you can usually flip through manually. In this tutorial, I will show you how to extract the images of a carousel into your desired format.
Many product web pages use image carousels, such as the one depicted below, to exhibit a series of images in a slide format that can typically be navigated manually.
In this instructional guide, we will demonstrate the process of scraping the images from a carousel and converting them into a desired format.
This tutorial uses the below link as an example and can be applied to a majority of carousel scenarios:
Format 1. One image URL per column
Example output:
Simply select one of the images, and select Image URL on the Tips Panel. Repeat the same process to fetch all the other image URLs.
NOTE: In this example page, we need to select the IMG tag from the bottom of the Tips to locate the image URL. Only when the IMG is selected, Octoparse will show the option Image URL on the Tips.
Format 2. One image URL per data line
Example output:
It is also possible to scrape images to different lines of the same column using a loop extract action.
Step 1. Click on the first image in the carousel
Step 2. Go to the Tips Panel and select the IMG tag - Select all similar elements
Step 3. Select Image URL
Format 3. All image URLs in one data field
Example output:
Option 1. Merge the extracted image URLs into one line
Once you've loop extracted the image URLs into different lines (following the steps in Scrape images to different lines), you can then merge the extracted data to merge the lines into one single line.
1) Click the More icon for the data field, then select Merge field data
Option 2. Scrape the HTML code of the carousel and match out the image URLs from the code
1) Select the entire carousel and select OuterHtml
2) Click the More icon for the field and select Clean data
3) Click Add Step and choose Matching with Regular Expression
4) Inspect the code to find the starting value and ending value of the image URL
5) Click Try the ReEx tool
6) Enter Start with and End with value to generate a RegEx and apply the settings
7) Tick Match all and Confirm
NOTE: The image URLs scraped are thumbnail URLs. If you need to get the full image URLs, you can continue to add steps to reformat the field. Please check this tutorial: How to scrape the full image URLs instead of thumbnails?