Analyze and Visualize a Retweet Network

Beta

Analyze and Visualize a Retweet Network

Twitter networks are a great source of information about how people communicate and connect. Fortunately, Twitter makes much of its data freely available, which you can read about in its developer documentation. This template queries retweet data and then plots it in an interactive visualization to understand conversations around a hashtag of your choice.

To be able to use this template, the following criteria must be satisfied:

You will need an active Twitter account.
You will need a bearer token for accessing the Twitter API.

To get a bearer token, you will need to navigate to this page and sign up for Essential access. This will take you through a short verification process. When you are finished, you should be able to create a new app and generate a bearer token which will be used to access the API.

Warning: This template will extract real Twitter data. As a result, some content may contain offensive language.

1. Getting Set Up

In order to access the Twitter API, you will need to use an integration to set an environment variable. To add a new integration in your Workspace, click on the Integrations icon in the far left toolbar of the Workspace editing interface. Next, click "Add Integration" and "Environment Variables". You will need to specify the name (BEARER_TOKEN) and the value (the token you were provided). You can call this "Twitter Integration". You can read more about integrations here. Click "Create" and follow the remaining steps, and you should be ready to go!

The code then performs the following:

Installs and imports the packages you will use to retrieve Twitter data and visualize it.
Sets your bearer token for accessing the Twitter API. This does not require further input if you have configured your BEARER_TOKEN environment variable correctly.
Sets the hashtags you want to compare. By default, this template analyzes a retweet network based on the hashtag "#python". You are free to supply any hashtag you wish to use (a topic preceded by a # symbol).
Initializes a tweepy Client. It will retrieve the last ten tweets for your hashtag as a test.

%%capture
# Install necessary packages
!pip install tweepy

# Import packages
import os
import tweepy
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import networkx as nx
import plotly.graph_objects as go

# Set bearer_token for essential access
bearer_token = os.environ["BEARER_TOKEN"]

# Set the hashtag you wish to explore
hashtag = "#python"

# Initialize the Tweepy client
client = tweepy.Client(bearer_token=bearer_token)

# Confirm the client is initialized by printing the 10 most recent tweets using your hashtag
for tweet in client.search_recent_tweets(hashtag).data:
    print(tweet.text)

The code above should return the text of the past 10 tweets using the hashtag you supplied. If you have not set up your integration correctly (or are using the wrong bearer token, you may encounter an error such as:

Unauthorized: 401 Unauthorized

If you do encounter such an error, make sure to review the instructions and try again.

2. Create a DataFrame of Retweets

Next, you can use the client to retrieve a specified number of tweets related to a topic. The code below defines a custom function that uses Paginator() to return batches of recent tweets (within the past seven days) about a specific topic. There are four parameters you can customize:

The hashtag you want to query.
The num_results you want to return per iteration. The number must be a multiple of 100, and cannot exceed 2000.
The language (lang) of the tweets you query. The language is set to English by default, but you can use other languages if you prefer!

Note: Depending on the number of results you return, this code can take some time to execute.

Run cancelled

# Define a function to query tweets
def get_retweets(hashtag, num_results=1000, lang="en"):
    # Initialize an empty DataFrame to store user data and tweets
    tweets_df = pd.DataFrame()
    
    # Return the number of batches based on num_results
    if num_results > 2000:
        raise ValueError("`num_results` must be less than or equal to 2000.")
    elif num_results % 100 != 0:
        raise ValueError("`num_results` must be a multiple of 100.")
    max_results = 100
    limit = num_results / max_results

    # Iterate through batches of tweets
    for tweet_batch in tweepy.Paginator(
        client.search_recent_tweets,
        query=hashtag + " is:retweet lang:" + lang,
        max_results=100,
        limit=limit,
        expansions=["author_id"],
        user_fields=["username", "id"],
    ):
        # Retrieve data and user data from batch and add it to DataFrame
        batch_data = pd.DataFrame(tweet_batch.data)
        users = {u["id"]: u["username"] for u in tweet_batch.includes["users"]}
        batch_data["retweeter"] = batch_data["author_id"].map(users)
        # Concatenate temporary DataFrames to existing DataFrames
        tweets_df = pd.concat([tweets_df, batch_data])

    # Merge user information to tweet information on author_id
    # Extract original tweeter from tweet text
    tweets_df["tweeter"] = tweets_df["text"].str.extract(r"@(\w+)")
    # Return DataFrame
    return tweets_df


# Create a DataFrame using the function defined above
df = get_retweets(
    hashtag,  # The hashtag you supplied at the beginning
    num_results=1000,  # The maximum results to return
)

# Preview the DataFrame
df

3. Calculating the Importance of Users in the Network

The next step is to analyze the network. The extracted tweets and relevant information are returned as a DataFrame containing an 'edge list', or a list of all edges between nodes. The code then calculates three measures of centrality for each user in the network.

In-degree centrality represents the number of edges going into a node. In the case of retweets, centrality will indicate that a user is getting a large number of retweets.
Out-degree centrality represents the number of edges going out of a node. In the case of retweets, centrality will indicate that a user is retweeting a lot.
Betweeness centrality represents the number of 'shortest paths' between nodes that pass through through a specific node. In the case of tweets, it measures the extent to which a user connects other communities of users.

The code also stores the number of in_degrees (tweets) and out_degrees (retweets) to visualize later with Plotly.

Run cancelled

# Create a directed network graph from the DataFrame
G = nx.from_pandas_edgelist(df, "retweeter", "tweeter", create_using=nx.DiGraph())

# Calculate the in-degree centrality for retweets
in_centrality = nx.in_degree_centrality(G)

# Caluculate the total number of tweets for later plotting purposes
in_degrees = dict(G.in_degree())

# Store centralities in DataFrame
popular_tweeters = pd.DataFrame(
    list(in_centrality.items()), columns=["username", "in-degree_centrality"]
)

# Print the most important users by their centrality
popular_tweeters.sort_values("in-degree_centrality", ascending=False).head()

Run cancelled

# Calculate the out-degree centrality for retweets
out_centrality = nx.out_degree_centrality(G)

# Caluculate the total number of retweets for later plotting purposes
out_degrees = dict(G.out_degree())

# Store centralities in DataFrame
active_retweeters = pd.DataFrame(
    list(out_centrality.items()), columns=["username", "out-degree_centrality"]
)

# Print the most important users by the amount they retweet
active_retweeters.sort_values("out-degree_centrality", ascending=False).head()

Run cancelled

# Create an undirected network graph from the DataFrame
G = nx.from_pandas_edgelist(df, "retweeter", "tweeter")

# Calculate the betweenness centrality for retweets
betweetnness = nx.betweenness_centrality(G)

# Store centralities in DataFrame
bridging_users = pd.DataFrame(
    list(betweetnness.items()), columns=["username", "betweenness"]
)

# Print the most important users by how much they bridge other users
bridging_users.sort_values("betweenness", ascending=False).head()

Try to compare this list of users with high betweeness with the list those with a high in-degree centrality. Are the users with the most retweets also connecting different communities together?

4. Visualizing Follower Networks

The code below creates an interactive network visualization using Plotly. There are a number of different attributes to the plot worth noting:

The nodes are colored by the number of retweeted tweets. Those with a number of retweeted tweets are colored blue and purple, and those who only retweet are colored yellow.
Those with more connections in total are larger to make it easier to spot them.

Run cancelled

# Create the graph and specify the layout of the graph
G = nx.from_pandas_edgelist(df, "retweeter", "tweeter")
pos = nx.drawing.layout.spring_layout(G)
nx.set_node_attributes(G, pos, "pos")

# Create a dictionary of nodes and their order
nodes_dict = {id: node for (id, node) in enumerate(G.nodes())}

# Gather edge positions for visualization
edge_x = []
edge_y = []
for edge in G.edges():
    x0_point, y0_point = G.nodes[edge[0]]["pos"]
    x1_point, y1_point = G.nodes[edge[1]]["pos"]
    edge_x.append(x0_point)
    edge_x.append(x1_point)
    edge_x.append(None)
    edge_y.append(y0_point)
    edge_y.append(y1_point)
    edge_y.append(None)

# Add edges as disconnected lines
edge_trace = go.Scatter(
    x=edge_x,
    y=edge_y,
    line=dict(width=0.5, color="#000000"),
    hoverinfo="none",
    mode="lines",
)

# Gather node positions for visualization
node_x = []
node_y = []
for node in G.nodes():
    x, y = G.nodes[node]["pos"]
    node_x.append(x)
    node_y.append(y)

# Iterate through the nodes and store the usernames and tweeting information
node_text = []
node_adjacencies = []
node_sizes = []
node_colors = []

# Iterate through the nodes and create the text to be shown on hover
for node_number, adjacencies in enumerate(G.adjacency()):
    node_text.append(  # Set the text to be shown on hover
        "Username: "
        + str(adjacencies[0])  # The username
        + "<br>"
        + "Number of Connections: "
        + str(len(adjacencies[1]))  # The number of connections
        + "<br>"
        + "Number of Retweeted Tweets: "
        + str(in_degrees[nodes_dict[node_number]])
        + "<br>"
        + "Number of Retweets: "
        + str(out_degrees[nodes_dict[node_number]])
    )
    node_adjacencies.append(len(adjacencies[1]))
    node_sizes.append(len(adjacencies[1]))
    node_colors.append(in_degrees[nodes_dict[node_number]])

# Log transform the color list for visualization purposes
node_colors = np.array(node_colors)
node_colors_temp = np.where(node_colors > 1.0e-10, node_colors, 1.0e-10)
node_colors_log = np.log10(node_colors_temp)

# Scale the size of the nodes between two values
scaler = MinMaxScaler(feature_range=(5, 30))
node_sizes = scaler.fit_transform(np.array(node_sizes).reshape(-1, 1))

# Plot the nodes
node_trace = go.Scatter(
    x=node_x,
    y=node_y,
    mode="markers",
    hoverinfo="text",
    marker=dict(
        showscale=True,
        colorscale="Plasma",  # For more colorscale options, go here: https://plotly.com/python/builtin-colorscales/
        reversescale=True,
        color=[],
        size=10,
        colorbar=dict(
            thickness=15,
            title="Number of Retweeted Tweets (Log Transformed)",
            titlefont=dict(size=12),
            xanchor="left",
            titleside="right",
            tickvals=[min(node_colors_log), max(node_colors_log)],
            ticktext=["Low", "High"],
        ),
        line=dict(color="Black", width=0.5),
    ),
)

# Set the size, color, and text of the nodes
node_trace.marker.size = node_sizes
node_trace.marker.color = node_colors_log
node_trace.text = node_text

# Customize and display the figure
fig = go.Figure(
    data=[edge_trace, node_trace],
    layout=go.Layout(
        title="<b>Retweet Network Graph for "
        + str(hashtag)
        + "<b>",  # Set your title here
        title_x=0.5,
        titlefont_size=16,
        showlegend=False,
        paper_bgcolor="#e6e6e6",  # Set the background color (excluding the plot) here
        plot_bgcolor="#e6e6e6",  # Set the plot background color here
        font_color="#000000",
        hovermode="closest",
        margin=dict(b=20, l=5, r=5, t=40),
        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        width=800,  # Adjust the width of the plot
        height=500,  # Adjust the height of the plot
    ),
)
fig.show()

Be sure to explore the plot and examine the network. In particular, try the following:

Hover over individual nodes to see their username, the number of connections they have, the number of tweets, and the number of retweets. Are there any names you already know?
Use the cursor to draw boxes and zoom in on particular regions of the plot. You can always double-click to zoom back out!
Inspect how the larger nodes with more connections interact with other larger nodes.

Note: Although there are many lines of code customizing the interactive visualization, you can run this code as-is without further modification. However, if you are interested, try adjusting things like colors, text, etc.

‌
‌
‌

Analyze and Visualize a Retweet Network

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Analyze and Visualize a Retweet Network

1. Getting Set Up

2. Create a DataFrame of Retweets

3. Calculating the Importance of Users in the Network

4. Visualizing Follower Networks

Analyze and Visualize a Retweet Network