Coding hints

How to get the meta description and title of links from a sitemap using PHP

How do you get the meta description and title of links from a sitemap using PHP? As you might know, the sitemap file by default doesn’t include information like the post title and the meta description of the links included.

For someone trying say to create an RSS Feed from an XML sitemap file with PHP, this approach of retrieving the meta description and title of links using the sitemap might come in handy.

What is a sitemap?

A sitemap is a list of pages of a website within a domain. There are three primary kinds of sitemaps: Sitemaps are used during the planning of a website by its designers. Human-visible listings, typically hierarchical, of the pages on a site. Structured listings intended for web crawlers such as search engines.

To get the meta description and title of links from a sitemap in PHP, you can use the SimpleXMLElement class to parse the XML data in the sitemap and retrieve the URLs. Then, for each URL, you can use the file_get_contents() function to retrieve the HTML content of the page, and then use regular expressions or a DOM parser to extract the title and meta description tags.

Below we have an example code snippet with the website https://example.com and whose sitemap URL is https://example.com/sitemap.xml:

<?php
// URL of the sitemap
$sitemap_url = "https://example.com/sitemap.xml";

// Load the sitemap XML into a SimpleXMLElement object
$sitemap_xml = new SimpleXMLElement(file_get_contents($sitemap_url));

// Loop through each URL in the sitemap
foreach ($sitemap_xml->url as $url) {
  // Get the URL from the XML object
  $url_str = (string)$url->loc;

  // Retrieve the HTML content of the page
  $html = file_get_contents($url_str);

  // Extract the title and meta description tags using regular expressions
  preg_match('/<title>(.*?)<\/title>/', $html, $title_matches);
  $title = $title_matches[1];

  preg_match('/<meta name="description" content="(.*?)"/', $html, $desc_matches);
  $description = $desc_matches[1];

  // Output the results
  echo "URL: " . $url_str . "\n";
  echo "Title: " . $title . "\n";
  echo "Description: " . $description . "\n\n";
}
?>

This code will output the URLs, titles, and descriptions for each URL in the sitemap. Note that this code uses regular expressions to extract the title and meta description tags, but you could also use a DOM parser like SimpleHTMLDOM instead.

Promoted contents:

Adding the retrieved meta description and title to the RSS feed

To add the title and meta description to an RSS feed item list in PHP, you can use the same code as in the previous answer to retrieve the title and description for each URL in the sitemap. Then, for each RSS feed item, you can add the title and description as additional elements in the RSS XML structure.

Here’s an example code snippet that demonstrates how to do this:

<?php
// URL of the sitemap
$sitemap_url = "https://example.com/sitemap.xml";

// Load the sitemap XML into a SimpleXMLElement object
$sitemap_xml = new SimpleXMLElement(file_get_contents($sitemap_url));

// Create a new RSS feed object
$rss = new SimpleXMLElement('<rss version="2.0"></rss>');
$channel = $rss->addChild('channel');

// Loop through each URL in the sitemap
foreach ($sitemap_xml->url as $url) {
  // Get the URL from the XML object
  $url_str = (string)$url->loc;

  // Retrieve the HTML content of the page
  $html = file_get_contents($url_str);

  // Extract the title and meta description tags using regular expressions
  preg_match('/<title>(.*?)<\/title>/', $html, $title_matches);
  $title = $title_matches[1];

  preg_match('/<meta name="description" content="(.*?)"/', $html, $desc_matches);
  $description = $desc_matches[1];

  // Create a new RSS feed item object
  $item = $channel->addChild('item');
  $item->addChild('title', $title);
  $item->addChild('description', $description);
  $item->addChild('link', $url_str);
}

// Output the RSS feed
header('Content-Type: application/rss+xml; charset=utf-8');
echo $rss->asXML();
?>

This code will output an RSS feed that includes the title, description, and URL for each item. Note that the code assumes that the RSS feed is in the RSS 2.0 format, but you can adjust the code as necessary for other formats. Also, make sure to set the correct content-type header when outputting the RSS feed.

Limiting the RSS feed output

To limit the RSS feed to the first 20 URLs in the sitemap for example, you can add a counter variable to the loop that iterates over the URLs, and use a break statement to exit the loop after the first 20 URLs have been processed.

Here’s an updated version of the code snippet that limits the feed to the first 20 URLs:

<?php
// URL of the sitemap
$sitemap_url = "https://example.com/sitemap.xml";

// Load the sitemap XML into a SimpleXMLElement object
$sitemap_xml = new SimpleXMLElement(file_get_contents($sitemap_url));

// Create a new RSS feed object
$rss = new SimpleXMLElement('<rss version="2.0"></rss>');
$channel = $rss->addChild('channel');

// Set a counter variable to keep track of the number of URLs processed
$count = 0;

// Loop through each URL in the sitemap
foreach ($sitemap_xml->url as $url) {
  // Get the URL from the XML object
  $url_str = (string)$url->loc;

  // Retrieve the HTML content of the page
  $html = file_get_contents($url_str);

  // Extract the title and meta description tags using regular expressions
  preg_match('/<title>(.*?)<\/title>/', $html, $title_matches);
  $title = $title_matches[1];

  preg_match('/<meta name="description" content="(.*?)"/', $html, $desc_matches);
  $description = $desc_matches[1];

  // Create a new RSS feed item object
  $item = $channel->addChild('item');
  $item->addChild('title', $title);
  $item->addChild('description', $description);
  $item->addChild('link', $url_str);

  // Increment the counter
  $count++;

  // Exit the loop after the first 20 URLs have been processed
  if ($count >= 20) {
    break;
  }
}

// Output the RSS feed
header('Content-Type: application/rss+xml; charset=utf-8');
echo $rss->asXML();
?>

Back to top button

Adblock Detected!

Hello, we detected you are using an Adblocker to access this website. We do display some Ads to make the revenue required to keep this site running. please disable your Adblock to continue.