All Collections
Case Tutorial
Jobs
Scrape job information from Indeed
Scrape job information from Indeed
Updated over a week ago

You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

Indeed is one of the most popular job posting websites. With web scraping, you can uncover the value of tons of job information. In this tutorial, we will show you how to use Octoparse to scrape job info, such as job title, company name, full description, etc, from Indeed.com.

Before we get started, we need to get the URL of the target result page by searching for a keyword and a location.

Below is an example URL for demonstration:

The easiest way to scrape the website is to go to "Task Templates" on the main screen of the Octoparse scraping tool and start with the ready-to-use Indeed Templates directly to save you time. Just input the URL into the template, and you can wait for the data to come out. For further details, you may check it out here: Task Templates

If you would like to know how to build the task from scratch, you may continue reading the following tutorial.

The main steps are shown in the menu on the right and you can download the demo task file at the bottom of this tutorial.


1. Create a Go to Web Page - to open the targeted web page

  • Enter the URL on the home page and click Start


2. Set up Pagination Loop - to scrape data from multiple listing pages

  • Click on the Next page button (>) on the page

  • Choose Loop click on the Tips panel

The Pagination will be created in the workflow.

pagination.jpg
  • Adjust the timeout as 10s

To make sure the pagination can work well, we need to modify the XPath of it.

  • Click on Pagination

  • Modify the XPath as //a[@data-testid="pagination-page-next"]

  • Click Apply to save

Note: If you see any pop-ups appear on the page, please turn on Browse mode in the upper right corner and manually close the pop-up window. After that, turn off browser mode to continue building the workflow.


3. Create Loop Item - to scrape the job posts list

  • Select the first two job titles

  • Choose Link

A Loop Item will be created in the workflow.

Loop_Item.jpg
  • Modify the XPath of the Loop as //*[@id="mosaic-provider-jobcards"]/ul/li

  • Click on the job title on the first job card

  • Choose Text+Link to extract the title and URL of each individual job post

  • Delete the data field you do not need


4. Add Custom Date Fields - to fetch basic job info on job cards

  • Click on the button Add custom field > Capture data on the page

  • Tick Relative XPath to the Loop Item

  • Enter the XPath as //div[@class="companyLocation"] to extract the company location

  • Repeat the steps above to extract data such as Job title, company name, salary, etc.

To make it more convenient for you to set up the task, we've already prepared you some XPaths for the data fields on the job card. You can copy and paste them in Octparse to save your time.

  • Job title: //h2[contains(@class,'jobTitle')]/a/span

  • Company name: //span[@class='companyName']

  • Salary: //div[contains(@class,'salary-snippet-container')] For this data, you need to set up the alternative Xpath to gather all the salary info: //div[@class='metadata estimated-salary-container']


5. Click on the Job Title - to open the job details board and fetch more data

  • Click on the job title on the first job card

  • Choose Click URL on the Tips panel

  • Modify the XPath of Click Item as //a[contains(@aria-label,'full details')]

  • Uncheck Open in a new Tab for Click Item

  • Set an AJAX timeout as 10s

  • Click on the Job Type and the whole section of the job description

  • Click Element data in the Tips panel

  • Switch to Vertical View

  • Modify the XPath of job description to: //div[@id="jobDescriptionText"]

  • Modify the XPath of Job type to: //div[text()='Job Type']/following-sibling::div

  • If Extract data in the Loop has been automatically ticked, please uncheck it

  • Set a Wait before action time as 3s


6. Run the task - to get your desired data

  • Click Save

  • Click Run on the upper left side

  • Select Run on your device to run the task on your computer, or select Run task in the Cloud to run the task in the Cloud (for premium users only)

Here is the sample data for your reference -

Did this answer your question?