Using TF*IDF Analysis To Write Clearly Defined Content

U

If you’re reading this post then you are either 1) similar with the term or 2) have absolutely no idea and you’re twisting your brain to understand how it’s relevant to SEO.

If you find yourself sitting in the number 2 category then this is a fantastic opportunity for you to hop on board the TF*IDF wagon and to start utilising this analysis as part of your content recommendation plan and roadmap. It’s all opportunity and when I was first introduced to the process, it excited me, I mean, it really excited me. I took the idea and then developed my own tool that automated the process for me.

Occasionally in SEO you find a golden goose, something that just WORKS, and I believe this strategy, for me, is one of them. It’s a combination of science and art, as content needs to be relevant for both search engines and users right? This TF*IDF process allows us to unravel sets of content to understand it’s intent and why Google deems a specific piece of content to have a great user experience attached to it, hence why it has a high position. We can then use the sets of content (one words, two words, three words) to identify frequent queries, synonyms, content gaps and silo’s that creates a specific intent path. Is it an informational intent? Is it a transactional intent? You’ll be able to get to grips of this from the queries, synonyms and adjectives that are used in the content to create a numerical statistic to determine how important a specific word or phrase is within the document. You can take these learnings to improve your own content.

You’re probably thinking … ah, it’s just keyword density! NO.

It’s by no means a magic cure, and I’m sure there are other folks out there that have a better system or process to calculate query importance at a much higher level, but I believe this is worthy of a post as I have learnt a lot during this process. The days of shoving keywords into a piece of content and hoping to rank are gone, in order to make a difference, you need a formula that not only works for crawlers, but users, too.

It’s by no means a magic cure, however, if used properly, it can certainly be the holy grail for search marketers.

What is TF*IDF

Good question, something I should’ve said earlier, TF*IDF stands for Term Frequency and Inverse Document Frequency Analysis. TF*IDF is a numerical statistic that can be used to determine how important a specific word or phrase is within a set of content. The analysis uses the keyword density metric to determine query frequency but then takes this data to calculate various other metrics that can be used to determine the weight of importance for a specific query. We will dive into these a little later.

TF-IDF Sheet

TF*IDF can be executed in many ways, there are some tools, however, I have decided to go away and use these formulas to create my own tool that I have automated. I use SEO Tools for Excel to help scrape the search results for the top 5 competitor sites for a given query, extract their query frequencies via keyword density injection, that then gives my system the ammo to execute on the TF*IDF process

It sounds technical, but it really isn’t. Let’s dive into the process.

TF*IDF Metrics and Terminologies

There are three different steps to calculating the TF*IDF for a given query;

    • Calculating query frequencies and occurrences;
    • Inverse document frequencies (TF) positions for each query;
    • TF ratio’s, averages, flags and competitive gaps.

Let’s dive into them…

1) Calculating query frequencies

This phase involves pulling keywords from a page and their respective frequencies, just like keyword density, however, I prefer to grab all of the one word, two word and three words that are referenced. This will allow us to identify the differences and intents between singular queries and phrases.

I tend to use a tool like SEObook Keyword Density to gather this information, however, very recently, I have been using SEO Tools for Excel.

TFIDF Query Frequencies

As you can see on the screenshot I have attached, I have inputted all of the queries into a spreadsheet, along with their frequencies for position 1-5 and the page we’re looking to assess on our website. Additionally, to aid the term frequency formula, I have calculated the total number of keyword in

stances per position by specifying this in URL1, URL2, URL3 and so forth at the top.

The following term frequency calculation used is as follows;

tf(t in d) = √frequency

2) Inverse document frequencies

This is probably the most complex set of formulas that builds the system’s foundation and makes the whole TFIDF process achievable. Luckily for us, the formulas have been in place from the very start thanks to Stone Temple Consulting, as they provided us with the TFIDF spreadsheet with all of the excel formulas built in, however, in our version, we expanded on these.

As we have the term frequencies from implementing the keyword density protocol, our system will need to then calculate how often the set of terms appear in the document, whilst calculating the frequencies of the content collection as a whole (aka all sites within the analysis). The more often a specific word or phrase is used, often results in that query / or set of queries having lower weight, as it’s not unique.

As an example here, if we were to work out the inverse document frequency of the word(s); “the”, “and”, “then”,”yes”, all of these words are commonly used between content sets of every site, therefore, the words aren’t unique and yield a lower inverse document frequency score.

If a webpage has referenced words like; “TFIDF”, “search engine optimisation”, “localisation”, these will help us identify the most interesting documents with lots of unique content. It’s more likely for a site in a first or second position to have more unique words and phrases than the rest of the content sets, as they offer a new and unique experience to the user, whilst mentioning and referencing queries that match the users intent. Hence why they’re at the top of Google.

Here is a screenshot outlining the terms with low TF ratio’s, this outlines the top queries that our competitors are using, that are mostly unique, and the phrases that all of our competitors are using, and we aren’t. This is an excellent opportunity to discount some of these queries, by referencing them in our content and to ensure we yield the use the same synonyms and vocabulary to yield their user intent. This is important because these are the top sites that are currently positioned at the top of Google, so they must be doing something right, eh? Keywords blurred to protect clients data.

TFIDF Low Ratio & Competitor Gap Opportunities

The trick is to then reference the unique keywords that your competitors are using, and the words that we have identified as part of this analysis, to ensure that they are no longer unique. Additionally, whilst doing this, you can ensure to optimise your content to include words and phrases that your competitors aren’t using either, so, during this process, you’re discounting their content, whilst improving on your own.

The inverse document frequency analysis is calculated as follows;

idf(t) = 1 + log ( numDocs / (docFreq + 1))

TF*IDF Example In Action

It’s easy to explain these things in lots of gobbledygook words, however, if we were to put this into practice, without the use of a system or spreadsheet, what would the formulas look like?

Good question, therefore, let’s assume that we’re conducting some content analysis for a blog post that our writers are compiling, and the post is about ‘TFIDF’.

I know, I’m not that creative, sorry.

For the sake of this demonstration, let’s assume a web page has, on average, 1000 words, wherein the term ‘TFIDF’ appears 50 times.

The term frequency calculation, therefore, for the phrase ‘TFIDF’ would be (50 / 1000 ) = 0.05.

If we were conducting this analysis at scale, and we have 50,000 separate documents (i.e websites), and the term ‘TFIDF’ appears in 5,000 of these, then the inverse document frequency calculation as (50,000 / 5,000) = 10

Therefore, the term frequency and inverse document weight would be 0.05 * 10 = 0.5.

How You Can Use TF*IDF

TFIDF can be used in a multitude of ways, if you’re a creative person you will most certainly think of your own ways of making the best use of this data. I tend to calculate the term frequencies and inverse document frequencies at scale, and then;

  • Identify my competitor’s unique words and phrases to replicate, to discount them;
  • To improve on important queries that may have a low ratio and reference them more often;
  • To identify queries that all of my top competitors are using and I am not;
  • Get an idea of the ‘type’ of content that Google deems as a great user experience and a deep insight into the synonyms, silos and vocabulary that these top-end pages are using.

By using the TFIDF process, you can tap into the sets of content being used by those ranking the top-end of search to understand user intent, the ‘type’ of content sets that Google prefers to rank or that particular query, the words and phrases they are using and the unique sets of phrases that are being used.

In this process, you can ensure to replicate your competitors content, add to your own and ensure that you’re maximising the potential from an on-site content SEO perspective. You’ll also look like an SEO guru when you explain the data-driven process to the content team, rather than simply asking them to reference certain queries in the hope of ranking.

Why TF*IDF?

Another good question and you may wonder if the TF*IDF is some-what legit, are Google using TF*IDF within their ranking algorithms to understand the intent, what about Panda?

It’s clear that Google’s using a set of formulas far more complex than a simple TF*IDF process to calculate the importance of queries, if it was as simple as this, then I’m sure we’d all be at the top of Google by now, however, Google has hinted that’s it’s been used indexing. I believe the TF*IDF is something we can use to enhance our learnings and is far more useful than a simple keyword density model.

Google has previously announced on their blog that;

“One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a 581 word article, and compare that to the usual frequency of “coach” — more like 5 in 330,000 words — you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous TFIDF, long used to index web pages.”

Great, How Do I Get Started With TF*IDF?

For starters, we don’t own this spreadsheet, however, a great place to start is by downloading the TF*IDF spreadsheet from Stone Temple Consulting on our downloads page.

This sheet contains all of the formulas to get you going on your TFIDF journey, however, you may decide to tamper it in such a way that suits your needs.

In time, once you get to grips of the spreadsheet, you may decide to implement your own methods, but it’s an excellent starting point.

Below are the steps to correctly setup this TFIDF spreadsheet;

1) Run all of your URL’s through a keyword density tool to collect the query data and their respective frequencies. These will need to be populated into a separate tab;

2) Copy and paste all of the one word, two word and three word results into this tab;

3) Sort the data to ensure the frequencies are properly organised;

4) Setup IF statements in the main sheet to populate the ‘Keywords’ field with all of the unique entries that are contained in the second sheet. This formula will, in essence, populate the system with the queries that are being used from all of the sites in the TFIDF process;

5) Setup VLOOKUPS in Position 1 – 5 cells to count the number of times a particular query was referenced, this formula will be responsible for calculating frequencies;

6) Then crack on with some analysis! The rest of the spreadsheet should automatically populate, thanks to Stone Temple Consulting, as they have added all of the TF/IDF formulas already.

About the author

Brett Saltalamacchia

Brett has a wealth of search experience, with 5 years under his belt working in-house for major global brands in the adult and financial sectors. He's also worked agency side so he's been exposed to every vertical and search classification.

Add comment

/* ]]> */