I’m working through an initial configuration of my crawler connector. I’m extracting product data from specific product pages. Each page has LD+JSON markup that looks similar to the following. If possible, I’d like to extract these values from here as the data is more structured and less likely to change than the html markup on the pages. I’m able to target the script element, but extracting specific values for a given key is difficult. I’m sure there is a better way to do this. Is there a better way to return values from a JSON object on page given a specific key?
Note that I am working to avoid using the crawler entirely, but the crawler is the short term solution for the moment.
<script type="application/ld+json">
{
"@context": "http://schema.org/",
"@type": "Product",
"name": "Wrist Brace",
"image": "https://imgcdn.companyname.com/CumulusWeb/Images/High_Res/1159155_ppkgleft.jpg" ,
"description": "Wrist Brace Company Low Profile / Contoured / Wraparound Aluminum / Cotton / Elastic Right Hand Beige Medium",
"manufacturer": "Company Brand",
"mpn": "000-79-87075",
"model": "000-79-87075",
"brand": {
"@type": "Thing",
"name": "Company Name"
}
}
</script>
I can pretty easily pull in the full JSON object with a XPath selector using the following criteria.
//script[@type="application/ld+json"]
This is an example of what the output of that selector looks like:
"{ ""@context"": ""http://schema.org/"", ""@type"": ""Product"", ""name"": ""Walker Boot"", ""image"": ""https://imgcdn.company.com/CumulusWeb/Images/High_Res/1159113_left.jpg"" , ""description"": ""Walker Boot company Medium Hook and Loop Closure Male 7-1/2 to 10-1/2 / Female 8-1/2 to 11-1/2 Left or Right Foot"", ""manufacturer"": ""company Brand"", ""mpn"": ""000-79-95505"", ""model"": ""000-79-95505"", ""brand"": { ""@type"": ""Thing"", ""name"": ""company"" } }"
Unfortunately, I don’t know if I can target the individual key/value pairs within the JSON object using either a CSS or XPath criteria.
It’s a little ugly, which is why I’m coming here to see if there is a better way, but I can get the key values by using a combination of two “extract” transforms.
Transform 1:
One to get all of the text after a given key, e.g.:
The following plain text criteria:
""name"": ""
Yields the following:
Walker Boot"", ""image"": ""https://imgcdn.mckesson.com/CumulusWeb/Images/High_Res/1159113_left.jpg"" , ""description"": ""Walker Boot McKesson Medium Hook and Loop Closure Male 7-1/2 to 10-1/2 / Female 8-1/2 to 11-1/2 Left or Right Foot"", ""manufacturer"": ""McKesson Brand"", ""mpn"": ""000-79-95505"", ""model"": ""000-79-95505"", ""brand"": { ""@type"": ""Thing"", ""name"": ""McKesson"" } }"
Transform 2:
If I transform again, using the following as a criteria:
""
It yields a cleaned name:
Walker Boot