Github

SEO Keyword Difficulty in Python

Content marketing is popular these days, and for good reason—lots of companies are driving a significant portion of their revenue from these efforts. However, deciding what content to create is usually done in a vacuum, without considering the SEO competitiveness beforehand. Why spend all that time creating original content if Google will never rank it?

To help decide which keywords we want to target, let's create a python script that will check how competitive a given keyword is, so we can try and get the best bang for our buck.

Who's Currently Ranking

To get an idea of how competitive a given keyword is, obviously the first step is to check who is currently ranking in the top spots. While you could scrape that information from Google itself, you are likely getting personalized results—and Google doesn't take kindly to scrapers anyway.

Instead we'll grab this information from the SEMrush API:

def top_results(api_key, keyword):
    url = 'http://api.semrush.com/?type=phrase_organic&key=%s' % api_key
    url += '&display_limit=3&export_columns=Ur&database=us'
    url += '&phrase=%s' % urllib.quote_plus(keyword)

    response = requests.get(url)
    return response.text.splitlines()[1:]

This function grabs the top 3 Google US results for a given keyword and returns their URLs as a list.

Now that we have the top results, we will do a very simple analysis to check whether we think that page is actually targeting that keyword, and how many backlinks it has.

Competitive Analysis

Generally if you are trying to rank a page for a certain keyword, then you will want to include that keyword in the page's title tag. This gives us a signal as to whether this page is ranking for our target keyword accidentally, or if the webmaster is actively trying to rank for it.

To check if the keyword appears in the page's title, we download the page and check the title tag in a case-insensitive manner:

def keyword_in_title(url, keyword):
    response = requests.get(url)
    tree = fromstring(response.content)
    return keyword in tree.findtext('.//title').lower()

The next step is to get the number of backlinks each current search result has. If a URL has thousands of backlinks then we will assume it will be much harder to out-rank it. SEMrush again provides this info, although other data providers would probably have more accurate counts (like ahrefs or Majestic).

def backlinks_count(api_key, result_url):
    parsed_url = urlparse.urlparse(result_url)
    url = 'http://api.semrush.com/analytics/v1/?key=%s' % api_key
    url += '&type=backlinks_overview&target_type=domain'
    url += '&target=%s' % parsed_url.hostname

    response = requests.get(url)
    return int(response.text.splitlines()[1].split(';')[0])

Computing The Score

The competitiveness score we are calculating is relatively simple: take the average number of links from the top 3 results. If a given site does not have the keyword in its title, then discount its number of links by half.

If you were doing this analysis for real you'd probably want to include other metrics in here, like backlinks for the entire domain, or social shares.

Here are the functions that score the results. The calculate_score function takes 3 lists as its arguments; the top 3 results, a list of booleans telling whether each has the keyword in its title, and a list of ints counting the number of backlinks for each.

def calculate_score(results, kw_in_title, backlinks):
    zipped = zip(results, kw_in_title, backlinks)
    scored = [score(result) for result in zipped]

    return sum(scored) / 3

def score(result):
    url, targeted, links = result
    if targeted:
        return links
    else:
        return int(links * 0.5)

Sample Results

The numbers by themselves don't really mean anything, but when used to compare possible keywords it becomes clear which will be easy to rank for:

semrush api      1382328
docker          65749034
google        1094447212

The Full Script

The only external dependency is requests. Tested on Python 2.7.

from lxml.html import fromstring
import random
import urllib
import urlparse

import requests

SEMRUSH_API_KEY = 'YOUR-SEMRUSH-KEY'

def top_results(api_key, keyword):
    url = 'http://api.semrush.com/?type=phrase_organic&key=%s' % api_key
    url += '&display_limit=3&export_columns=Ur&database=us'
    url += '&phrase=%s' % urllib.quote_plus(keyword)

    response = requests.get(url)
    return response.text.splitlines()[1:]

def backlinks_count(api_key, result_url):
    parsed_url = urlparse.urlparse(result_url)
    url = 'http://api.semrush.com/analytics/v1/?key=%s' % api_key
    url += '&type=backlinks_overview&target_type=domain'
    url += '&target=%s' % parsed_url.hostname

    response = requests.get(url)
    return int(response.text.splitlines()[1].split(';')[0])

def keyword_in_title(url, keyword):
    response = requests.get(url)
    tree = fromstring(response.content)
    return keyword in tree.findtext('.//title').lower()

def calculate_score(results, kw_in_title, backlinks):
    zipped = zip(results, kw_in_title, backlinks)
    scored = [score(result) for result in zipped]

    return sum(scored) / 3

def score(result):
    url, targeted, links = result
    if targeted:
        return links
    else:
        return int(links * 0.5)

if __name__ == '__main__':
    keyword = 'YOUR KEYWORD'

    results = top_results(SEMRUSH_API_KEY, keyword)
    kw_in_title = [keyword_in_title(url, keyword) for url in results]
    backlinks = [backlinks_count(SEMRUSH_API_KEY, url) for url in results]

    print(calculate_score(results, kw_in_title, backlinks))