KnowledgeGraph - Crawler Connector - Extract from LD+JSON Element

Zach_Shearer · March 10, 2022, 3:42pm

I’m working through an initial configuration of my crawler connector. I’m extracting product data from specific product pages. Each page has LD+JSON markup that looks similar to the following. If possible, I’d like to extract these values from here as the data is more structured and less likely to change than the html markup on the pages. I’m able to target the script element, but extracting specific values for a given key is difficult. I’m sure there is a better way to do this. Is there a better way to return values from a JSON object on page given a specific key?

Note that I am working to avoid using the crawler entirely, but the crawler is the short term solution for the moment.

<script type="application/ld+json">
                {
                    "@context": "http://schema.org/",
                    "@type": "Product",
                    "name": "Wrist Brace",
                    "image":  "https://imgcdn.companyname.com/CumulusWeb/Images/High_Res/1159155_ppkgleft.jpg" ,
                    "description": "Wrist Brace Company Low Profile / Contoured / Wraparound Aluminum / Cotton / Elastic Right Hand Beige Medium",
                    "manufacturer": "Company Brand",
                    "mpn": "000-79-87075",
                    "model": "000-79-87075",
                    "brand": {
                        "@type": "Thing",
                        "name": "Company Name"
                    }
                }
            </script>

I can pretty easily pull in the full JSON object with a XPath selector using the following criteria.

//script[@type="application/ld+json"]

This is an example of what the output of that selector looks like:

"{ ""@context"": ""http://schema.org/"", ""@type"": ""Product"", ""name"": ""Walker Boot"", ""image"": ""https://imgcdn.company.com/CumulusWeb/Images/High_Res/1159113_left.jpg"" , ""description"": ""Walker Boot company Medium Hook and Loop Closure Male 7-1/2 to 10-1/2 / Female 8-1/2 to 11-1/2 Left or Right Foot"", ""manufacturer"": ""company Brand"", ""mpn"": ""000-79-95505"", ""model"": ""000-79-95505"", ""brand"": { ""@type"": ""Thing"", ""name"": ""company"" } }"

Unfortunately, I don’t know if I can target the individual key/value pairs within the JSON object using either a CSS or XPath criteria.

It’s a little ugly, which is why I’m coming here to see if there is a better way, but I can get the key values by using a combination of two “extract” transforms.

Transform 1:
One to get all of the text after a given key, e.g.:

The following plain text criteria:

""name"": ""

Yields the following:

Walker Boot"", ""image"": ""https://imgcdn.mckesson.com/CumulusWeb/Images/High_Res/1159113_left.jpg"" , ""description"": ""Walker Boot McKesson Medium Hook and Loop Closure Male 7-1/2 to 10-1/2 / Female 8-1/2 to 11-1/2 Left or Right Foot"", ""manufacturer"": ""McKesson Brand"", ""mpn"": ""000-79-95505"", ""model"": ""000-79-95505"", ""brand"": { ""@type"": ""Thing"", ""name"": ""McKesson"" } }"

Transform 2:
If I transform again, using the following as a criteria:

""

It yields a cleaned name:

Walker Boot

Micaela_Luders · March 10, 2022, 11:19pm

Hi Zach,

Unfortunately, there isn’t an easier way to extract LD+JSON markup with the crawler. The transforms route that you described is currently the best option to get the result you are looking for. However, you can submit this idea for our Product Managers to consider for our product roadmap to the Ideas board!

Topic		Replies	Views
Schema.org(LD+JSON) extraction for connectors workaround Content	0	914	April 29, 2022
Formatted content via a data connector Search	1	540	October 27, 2021
Questions re Yext Web Crawler and Data Connectors Content spring21-release	10	1695	October 10, 2021
Yext Site Crawler Spring '21 Release spring21-release	0	1941	March 15, 2021
Crawler Connector: How to retrieve attribute value in Entity Container? Content	2	502	March 15, 2022

KnowledgeGraph - Crawler Connector - Extract from LD+JSON Element

Related topics