XPath plays an important role when you use Octoparse to scrape data. Rewriting XPath can help you deal with missing pages, missing data or duplicates, etc. While XPath may look intimidating at first, it need not be. In this article, I will briefly introduce XPath and more importantly, show you how it can be used to fetch the data you need by building tasks that are accurate and precise.
1. What is XPath
XPath (XML Path Language) is a query language for selecting elements from an XML/HTML document. It can help you find an element from the whole document precisely and quickly.
Web pages are generally written in a language called HTML. If you load a web page on a browser (Chrome, Firefox, etc), you can easily access the corresponding HTML doc by hitting the F12 key. Everything you see on the webpage can be found within the HTML, such as an image, blocks of text, links, menus and etc.
XPath is the most commonly used language when people need to locate an element in an HTML doc. It can be easily understood as the "path" to find the target element within the HTML doc.
To further explain how XPath works. Let's look at an example.
The image shows part of an HTML doc.
HTML has different levels of elements, just like a tree structure. In this example, Level 1 is bookstore and level 2 is book. Title, author, year, price are all level 3.
Text with angle brackets(<bookstore>) is called a tag. An HTML element usually consists of a start tag and an end tag, with the content inserted in between.
<tagname>Content goes here...</tagname>
XPath uses "/" to connect tags of different levels from the top to the bottom in order to specify the location of an element. For our example, if we want to locate the element "author", the XPath would be like:
If you are having trouble understanding how it works, think about how we go about finding a particular file on our computer.
To find the file named "author", the exact file path is \bookstore\book\author. Look familiar?
Every file on the computer has its own path, so are the elements on a web page. With XPath, you can find the page elements quickly and easily just like finding a file on your computer.
The XPath that starts from the root element (the top element in the doc) and goes through all the elements in between to the target element is called an Absolute XPath.
Absolute path can be long and confusing, so to simplify Absolute XPath, we can use "//" to reference the element we want to start the XPath with (also known as a short XPath). For example, the short XPath for /bookstore/book/author can be written as //book/author. This short XPath would look for the element book regardless of its absolute location in the HTML, then go one level down to find the target element author.
2. Why do you need to know XPath when using Octoparse
Scraping web pages with Octoparse is actually to scrape elements from HTML docs. XPath is used to locate target elements from the doc. Let's take the pagination action as an example.
After we select the next button to build the pagination action, Octoparse would generate an XPath to locate the next button, so that it knows which button to click.
XPath helps the crawler to click the right button or to scrape the target data. Any action you want Octoparse to do is based on the underlying XPath. Octoparse can generate XPaths automatically but the auto-generated ones do not always work well. That's why we need to learn to rewrite XPath.
When dealing with issues like missing data, endless loop, incorrect data, duplicative data, next button not getting clicked, etc, there's a good chance you'd fix these issues easily by re-writing the XPath.
3. How to write an XPath (cheat sheet included)
Before we start writing an XPath, let's first cover some key terms.
Here's a sample HTML we'll use to demonstrate.
An attribute provides additional information about an element and is always specified in the start tag of the element. An attribute usually come in name/value pairs like: name="value". Some of the most common attributes are href, title, style, src, id, class, and many more. You can find the complete HTML attribute reference here.
In our example, id="book" is the attribute of the <div> element and class="book_name" is the attribute of the <span> element.
When one or more HTML elements are contained within an element, the element that contains the other elements is called the parent, and the contained element is a child of the parent. Each element has only one parent but it may have zero, one or more children. Children are found between the start tag and the end tag of the parent.
In our example, the <body> element is the parent of the <h1> and <div> elements. The <h1> and <div> elements are children of the <body> element.
The <div> element is the parent of the two <span> elements. The <span> elements are the children of the <div> element.
Elements that have the same parent are called siblings. The <h1> and <div> elements are siblings as they have the same parent <body>.
The two <span> elements, both indented under the <div> element are also siblings.
Let's look at some common use cases!
Write an XPath to locate the Next Page button
So first we'll have to inspect the Next Page button in the HTML closely. In the sample HTML below, there are two things that stand out. First, there is a title attribute with the value "Next" and second, the content "Next".
In this case, we can use either the title attribute or the content text to target the Next Page button in the HTML.
The XPath that locates the <a> element that has a title attribute with the value "Next" would look like this: //a[@title="Next"]
This XPath basically says, go to the <a> element(s) whose title attribute is "Next". The @ symbol is used in the XPath to target an attribute.
Alternatively, the XPath that locates the <a> element that has "Next" included in the content looks like this: //a[contains(text(), "Next")]
This XPath says, go to the <a> element(s) whose content contains the text "Next".
You can also use both the title attribute and the context text to write the XPath.
//a[@title="Next" and contains(text(), "Next")]
This XPath says, go to the <a> element(s) that has a title attribute with value "Next" and whose content includes the text "Next".
Write an XPath to locate loop item
To target a list of items on a web page, it is important to look for the pattern among the list items. Items of the same list generally share the same or similar attributes. In the sample HTML below, we see that all <li> elements have similar class attributes.
Based on the observation, we can use contains(@attribute) to target all items of the list.
This XPath says, go to the <li> element(s) whose class attribute contain "product_item".
Write an XPath to locate data fields
Locating a particular data field is very similar to locating the Next Page button using text() or attribute.
Say if we want to write an XPath that locates the address in the sample HTML above. We can use the itemprop attribute that has the value "address" to target the particular element.
This XPath says, go to the <div> element that has itemprop attribute with the value "address".
There's another way to approach this. Notice how the <div> element containing the actual address is always found under its sibling <div> element, one that has the content "Location:". So we can first locate the "Location" text and then select the first sibling that follows.
This XPath says, go to the <div> element that has "Location" included in the content, then goes to its first sibling <div> element.
Now, you may have already noticed there is actually more than one way to target an element in the HTML. This is true just like there is always more than one path to any destination. The key is to make use of the tag, attributes, content text, siblings, parent, whatever that helps you locate the target element in the HTML.
To make things easier for you, here is a cheat sheet of helpful XPath expressions to help you quickly target any elements in the HTML.
Matches any elements
Selects all the child element of the <div> element
Selects all the <div> elements that have an "id" attribute with a value of "book"
Finds elements with exact text
Selects all the <span> elements whose content is exactly “Harry Potter”
Selects elements that contain a certain string
Selects all the <span> elements whose class attribute value contains "price"
Selects all the <span> elements whose content contains "Learning"
Selects elements in a certain position
Selects the second <span> element that is the child of the <div> element
Selects the first 2 <span> elements that are the child of <div> element
Selects the last element
Select the last <span> element that is the child of <div> element
Selects the last but one <span> element that is the child of <div> element
Selects the last 3 <span> elements that are the child of <div> element
Selects elements that are opposite to the conditions specified
Selects all the <span> elements whose class attribute value does not contain price
Selects all the <span> elements whose text does not contain "Learning".
Selects elements that match several conditions
//span[@class="book_name" and text()="Harry Potter"]
Selects all the <span> elements whose class attribute value is "book_name" and the text is "Harry Potter"
Selects elements that match one of the conditions
//span[@class="book_name" or text()="Harry Potter"]
Selects all the <span> elements whose class attribute value is "book_name" or the text is "Harry Potter"
Selects all siblings after the current element
Selects the first <span> element after the <span> element whose text is "Harry Potter"
Selects all siblings before the current element
Selects the first <span> element before the <span> element whose class attribute value is "regular_price"
Selects the parent of the current element
Select the parent of the <div> element whose id attribute value is "bookstore"
Selects several paths
//div[@id="bookstore"] | //span[@class="regular_price"]
Selects all the <div> elements whose id attribute value is "bookstore" and all the <span> elements whose class attribute value is "regular_price".
*Note that the attribute and text value are all case-sensitive.
*For a more exhaustive list of XPath expressions, check this out.
4. Absolute XPath and Relative XPath (for loop）
So far we've covered how to write an XPath when you need to extract an element from a webpage directly. There are times, however, you may need to first build a list of target items then extract the data from each item. For example, when you need to extract data from results pages like this (https://www.bestbuy.com/site/promo/tv-deals).
In this case, you are not only required to know the Absolute XPath (which you'd use for capturing elements directly), but also the Relative XPath to the Loop Item, one that specifies the location of the specific list item relative to the list.
In Octoparse, when you modify the XPath of a data field, you will see there are two XPath boxes.
Absolute XPath is used when we extract data from the web page directly.
NOTE: the Absolute Xpath in Octoparse is different from the one above. This Absolute Xpath means the data you extracted is one from the whole website rather than the data in the loop, and can also be concise like "//h1[@class="..."]/span...", rather than complicated "/html/body/div/div/div/div/div/div/span/span…".
TIP: You can also check the XPath type and the element XPath easily by switching the Data Preview to Vertical View mode.
Relative XPath is used when we extract data from a loop item. Specifically, when we build a workflow like this:
The Relative XPath in Octoparse is an additional part of the element XPath relative to Loop Item XPath.
For example, if we want to create a loop list of <li> elements and scrape an element contained within the individual <li> elements in the list, we can use the XPath //ul[@class="results"]/li to locate the list.
Suppose, the XPath of an element on the list is //ul[@class="results"]/li/div/a[@class="link"]. So in this case, the Relative XPath should be /div/a[@class="link"]. Or we can simplify this Relative XPath using "//" to //a[@class="link"]. It is always recommended to use "//" when writing Relative XPath as it would make the expression more concise.
Let's make it easier to see the relationship between the different XPaths.
Loop Item XPath: //ul[@class="results"]/li
Xpath of the element you want to locate in Loop Item: //ul[@class="results"]/li/div/a[@class="link"]
Relative XPath to the Loop Item: /div/a[@class="link"]
We should then enter the Loop Item XPath and the Relative XPath like this in Octoparse:
Now you may have already noticed that when the XPath for the loop list and the relative XPath are combined into one XPath, there you have exactly the XPath for the element.
5. 4 Simple steps to fix your XPath
STEP 1: Open the webpage using a browser with an XPath tool (one that allows you to view the HTML and look up an XPath query). Path helper (a Chrome extension) is always recommended if you use Chrome.
STEP 2: Once you have the web page loaded, inspect the target element in the HTML.
STEP 3: Inspect the HTML element closely, as well as the elements nearby. Do you see anything that stands out and may help you identify and locate the target element? Perhaps a class attribute like class="sku-title" or class="sku-header"?
Use the cheat sheet above to write an XPath that selects the element exclusively and precisely. Your XPath should only match the target element(s) and nothing else on the entire HTML doc. Using XPath helper, you can always test to see if the re-written XPath is working right.
STEP 4: Replace the auto-generated XPath in Octoparse.
More step-by-step tutorials:
6. XPath troubleshooting tutorials
In most cases, you don’t need to write the XPath on your own. But there are some situations where you might need to do some modifications to scrape more accurately.
Data fields issues
7. XPath tool
It is not easy to check the HTML code directly in Octoparse, so we need to use some other tools to help us generate an XPath.
You can get an XPath for an element easily with any browser. Let’s take Chrome as an example.
Open the web page in Chrome
Right-click on the item you want to find the XPath of
Choose "inspect" and you will see Chrome DevTools
Right-click on the highlighted area on the console.
Go to Copy -> Copy XPath
But the XPath copied sometimes is an Absolute XPath when there is no attribute or the attribute value is too long. You may still need to write the correct XPath.
XPath Helper (download here)
XPath Helper is a superb chrome extension that allows you to look up XPath by simply hovering over the element from the browser. You can also edit the XPath query directly in the console. You'll get the result(s) immediately so you know if your XPath is working correctly or not.
More on XPath: