You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!
Octoparse tracks data with XPath but data can change location within a web page. To tackle this, we will show you how you can extract data more accurately by associating it with a text nearby.
First, let’s look at an example of when this technique can be useful.
In the example image above, the value for "Brand" is located next to the words "Brand". Similarly, the value for "Item Weight" will always be found next to the words "Item Weight". The same pattern should apply to the rest of the list.
While "Item Weight" might change places from the third row to the fourth row of the list, its associated value should always be found next to it. Therefore, a more consistent way to find and capture the associated values of any elements is really to first look for where the words are, then locate the data next to it. In this example, instead of trying to find the value of "10 pounds" directly on the page, we can get it captured more accurately when we relate it to the text of "Item Weight".
Follow the steps below to see how it is done:
STEP 1. Click on 10 pounds on the page to extract the text for Item Weight.
STEP 2. Go to the data preview panel and click Customize XPath
STEP 3. Find the XPath relating to the text of the target data field
Now, open the page in the Chrome browser and right-click to inspect the target data
Notice the actual words of "Item Weight" can be found within the <th> tag while its associated value is found within the <td> tag right below it.
Once we see the pattern, we can write an XPath to look for the value of "Item Weight" relative to where we will actually find the words: "//th[contains(text(),'Item Weight')]/following-sibling::td" - This XPath expression is telling the program to look for the <th> tag containing the text of "Item Weight", then find the first <td> tag located right below it. This will give us exactly what we want, the associated value of "Item Weight".
Input the new XPath to the text box for Matching XPath, and click Apply to save the settings.
Now, Octoparse will always look for the associated value of "Item Weight" according to where the words "Item Weight" are shown on the web page. Applying this technique to similar fields on the list can help reduce the chance of scraping the wrong elements.
Tip: Following-sibling is very often used for finding an element located next to another designated element. Learn more about XPath here!