Analyze and Visualize a Retweet Network
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Analyze and Visualize a Retweet Network

    Twitter networks are a great source of information about how people communicate and connect. Fortunately, Twitter makes much of its data freely available, which you can read about in its developer documentation. This template queries retweet data and then plots it in an interactive visualization to understand conversations around a hashtag of your choice.

    To be able to use this template, the following criteria must be satisfied:

    • You will need an active Twitter account.
    • You will need a bearer token for accessing the Twitter API.

    To get a bearer token, you will need to navigate to this page and sign up for Essential access. This will take you through a short verification process. When you are finished, you should be able to create a new app and generate a bearer token which will be used to access the API.

    Warning: This template will extract real Twitter data. As a result, some content may contain offensive language.

    1. Getting Set Up

    In order to access the Twitter API, you will need to use an integration to set an environment variable. To add a new integration in your Workspace, click on the Integrations icon in the far left toolbar of the Workspace editing interface. Next, click "Add Integration" and "Environment Variables". You will need to specify the name (BEARER_TOKEN) and the value (the token you were provided). You can call this "Twitter Integration". You can read more about integrations here. Click "Create" and follow the remaining steps, and you should be ready to go!

    The code then performs the following:

    1. Installs and imports the packages you will use to retrieve Twitter data and visualize it.
    2. Sets your bearer token for accessing the Twitter API. This does not require further input if you have configured your BEARER_TOKEN environment variable correctly.
    3. Sets the hashtags you want to compare. By default, this template analyzes a retweet network based on the hashtag "#python". You are free to supply any hashtag you wish to use (a topic preceded by a # symbol).
    4. Initializes a tweepy Client. It will retrieve the last ten tweets for your hashtag as a test.
    %%capture
    # Install necessary packages
    !pip install tweepy
    # Import packages
    import os
    import tweepy
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import MinMaxScaler
    import networkx as nx
    import plotly.graph_objects as go
    
    # Set bearer_token for essential access
    bearer_token = os.environ["BEARER_TOKEN"]
    
    # Set the hashtag you wish to explore
    hashtag = "#python"
    
    # Initialize the Tweepy client
    client = tweepy.Client(bearer_token=bearer_token)
    
    # Confirm the client is initialized by printing the 10 most recent tweets using your hashtag
    for tweet in client.search_recent_tweets(hashtag).data:
        print(tweet.text)

    The code above should return the text of the past 10 tweets using the hashtag you supplied. If you have not set up your integration correctly (or are using the wrong bearer token, you may encounter an error such as:

    Unauthorized: 401 Unauthorized

    If you do encounter such an error, make sure to review the instructions and try again.

    2. Create a DataFrame of Retweets

    Next, you can use the client to retrieve a specified number of tweets related to a topic. The code below defines a custom function that uses Paginator() to return batches of recent tweets (within the past seven days) about a specific topic. There are four parameters you can customize:

    • The hashtag you want to query.
    • The num_results you want to return per iteration. The number must be a multiple of 100, and cannot exceed 2000.
    • The language (lang) of the tweets you query. The language is set to English by default, but you can use other languages if you prefer!

    Note: Depending on the number of results you return, this code can take some time to execute.

    Run cancelled
    # Define a function to query tweets
    def get_retweets(hashtag, num_results=1000, lang="en"):
        # Initialize an empty DataFrame to store user data and tweets
        tweets_df = pd.DataFrame()
        
        # Return the number of batches based on num_results
        if num_results > 2000:
            raise ValueError("`num_results` must be less than or equal to 2000.")
        elif num_results % 100 != 0:
            raise ValueError("`num_results` must be a multiple of 100.")
        max_results = 100
        limit = num_results / max_results
    
        # Iterate through batches of tweets
        for tweet_batch in tweepy.Paginator(
            client.search_recent_tweets,
            query=hashtag + " is:retweet lang:" + lang,
            max_results=100,
            limit=limit,
            expansions=["author_id"],
            user_fields=["username", "id"],
        ):
            # Retrieve data and user data from batch and add it to DataFrame
            batch_data = pd.DataFrame(tweet_batch.data)
            users = {u["id"]: u["username"] for u in tweet_batch.includes["users"]}
            batch_data["retweeter"] = batch_data["author_id"].map(users)
            # Concatenate temporary DataFrames to existing DataFrames
            tweets_df = pd.concat([tweets_df, batch_data])
    
        # Merge user information to tweet information on author_id
        # Extract original tweeter from tweet text
        tweets_df["tweeter"] = tweets_df["text"].str.extract(r"@(\w+)")
        # Return DataFrame
        return tweets_df
    
    
    # Create a DataFrame using the function defined above
    df = get_retweets(
        hashtag,  # The hashtag you supplied at the beginning
        num_results=1000,  # The maximum results to return
    )
    
    # Preview the DataFrame
    df

    3. Calculating the Importance of Users in the Network

    The next step is to analyze the network. The extracted tweets and relevant information are returned as a DataFrame containing an 'edge list', or a list of all edges between nodes. The code then calculates three measures of centrality for each user in the network.

    • In-degree centrality represents the number of edges going into a node. In the case of retweets, centrality will indicate that a user is getting a large number of retweets.
    • Out-degree centrality represents the number of edges going out of a node. In the case of retweets, centrality will indicate that a user is retweeting a lot.
    • Betweeness centrality represents the number of 'shortest paths' between nodes that pass through through a specific node. In the case of tweets, it measures the extent to which a user connects other communities of users.

    The code also stores the number of in_degrees (tweets) and out_degrees (retweets) to visualize later with Plotly.

    Run cancelled
    # Create a directed network graph from the DataFrame
    G = nx.from_pandas_edgelist(df, "retweeter", "tweeter", create_using=nx.DiGraph())
    
    # Calculate the in-degree centrality for retweets
    in_centrality = nx.in_degree_centrality(G)
    
    # Caluculate the total number of tweets for later plotting purposes
    in_degrees = dict(G.in_degree())
    
    # Store centralities in DataFrame
    popular_tweeters = pd.DataFrame(
        list(in_centrality.items()), columns=["username", "in-degree_centrality"]
    )
    
    # Print the most important users by their centrality
    popular_tweeters.sort_values("in-degree_centrality", ascending=False).head()
    Run cancelled
    # Calculate the out-degree centrality for retweets
    out_centrality = nx.out_degree_centrality(G)
    
    # Caluculate the total number of retweets for later plotting purposes
    out_degrees = dict(G.out_degree())
    
    # Store centralities in DataFrame
    active_retweeters = pd.DataFrame(
        list(out_centrality.items()), columns=["username", "out-degree_centrality"]
    )
    
    # Print the most important users by the amount they retweet
    active_retweeters.sort_values("out-degree_centrality", ascending=False).head()
    Run cancelled
    # Create an undirected network graph from the DataFrame
    G = nx.from_pandas_edgelist(df, "retweeter", "tweeter")
    
    # Calculate the betweenness centrality for retweets
    betweetnness = nx.betweenness_centrality(G)
    
    # Store centralities in DataFrame
    bridging_users = pd.DataFrame(
        list(betweetnness.items()), columns=["username", "betweenness"]
    )
    
    # Print the most important users by how much they bridge other users
    bridging_users.sort_values("betweenness", ascending=False).head()

    Try to compare this list of users with high betweeness with the list those with a high in-degree centrality. Are the users with the most retweets also connecting different communities together?

    4. Visualizing Follower Networks

    The code below creates an interactive network visualization using Plotly. There are a number of different attributes to the plot worth noting:

    • The nodes are colored by the number of retweeted tweets. Those with a number of retweeted tweets are colored blue and purple, and those who only retweet are colored yellow.
    • Those with more connections in total are larger to make it easier to spot them.
    Run cancelled
    # Create the graph and specify the layout of the graph
    G = nx.from_pandas_edgelist(df, "retweeter", "tweeter")
    pos = nx.drawing.layout.spring_layout(G)
    nx.set_node_attributes(G, pos, "pos")
    
    # Create a dictionary of nodes and their order
    nodes_dict = {id: node for (id, node) in enumerate(G.nodes())}
    
    # Gather edge positions for visualization
    edge_x = []
    edge_y = []
    for edge in G.edges():
        x0_point, y0_point = G.nodes[edge[0]]["pos"]
        x1_point, y1_point = G.nodes[edge[1]]["pos"]
        edge_x.append(x0_point)
        edge_x.append(x1_point)
        edge_x.append(None)
        edge_y.append(y0_point)
        edge_y.append(y1_point)
        edge_y.append(None)
    
    # Add edges as disconnected lines
    edge_trace = go.Scatter(
        x=edge_x,
        y=edge_y,
        line=dict(width=0.5, color="#000000"),
        hoverinfo="none",
        mode="lines",
    )
    
    # Gather node positions for visualization
    node_x = []
    node_y = []
    for node in G.nodes():
        x, y = G.nodes[node]["pos"]
        node_x.append(x)
        node_y.append(y)
    
    # Iterate through the nodes and store the usernames and tweeting information
    node_text = []
    node_adjacencies = []
    node_sizes = []
    node_colors = []
    
    # Iterate through the nodes and create the text to be shown on hover
    for node_number, adjacencies in enumerate(G.adjacency()):
        node_text.append(  # Set the text to be shown on hover
            "Username: "
            + str(adjacencies[0])  # The username
            + "<br>"
            + "Number of Connections: "
            + str(len(adjacencies[1]))  # The number of connections
            + "<br>"
            + "Number of Retweeted Tweets: "
            + str(in_degrees[nodes_dict[node_number]])
            + "<br>"
            + "Number of Retweets: "
            + str(out_degrees[nodes_dict[node_number]])
        )
        node_adjacencies.append(len(adjacencies[1]))
        node_sizes.append(len(adjacencies[1]))
        node_colors.append(in_degrees[nodes_dict[node_number]])
    
    # Log transform the color list for visualization purposes
    node_colors = np.array(node_colors)
    node_colors_temp = np.where(node_colors > 1.0e-10, node_colors, 1.0e-10)
    node_colors_log = np.log10(node_colors_temp)
    
    # Scale the size of the nodes between two values
    scaler = MinMaxScaler(feature_range=(5, 30))
    node_sizes = scaler.fit_transform(np.array(node_sizes).reshape(-1, 1))
    
    # Plot the nodes
    node_trace = go.Scatter(
        x=node_x,
        y=node_y,
        mode="markers",
        hoverinfo="text",
        marker=dict(
            showscale=True,
            colorscale="Plasma",  # For more colorscale options, go here: https://plotly.com/python/builtin-colorscales/
            reversescale=True,
            color=[],
            size=10,
            colorbar=dict(
                thickness=15,
                title="Number of Retweeted Tweets (Log Transformed)",
                titlefont=dict(size=12),
                xanchor="left",
                titleside="right",
                tickvals=[min(node_colors_log), max(node_colors_log)],
                ticktext=["Low", "High"],
            ),
            line=dict(color="Black", width=0.5),
        ),
    )
    
    # Set the size, color, and text of the nodes
    node_trace.marker.size = node_sizes
    node_trace.marker.color = node_colors_log
    node_trace.text = node_text
    
    # Customize and display the figure
    fig = go.Figure(
        data=[edge_trace, node_trace],
        layout=go.Layout(
            title="<b>Retweet Network Graph for "
            + str(hashtag)
            + "<b>",  # Set your title here
            title_x=0.5,
            titlefont_size=16,
            showlegend=False,
            paper_bgcolor="#e6e6e6",  # Set the background color (excluding the plot) here
            plot_bgcolor="#e6e6e6",  # Set the plot background color here
            font_color="#000000",
            hovermode="closest",
            margin=dict(b=20, l=5, r=5, t=40),
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            width=800,  # Adjust the width of the plot
            height=500,  # Adjust the height of the plot
        ),
    )
    fig.show()

    Be sure to explore the plot and examine the network. In particular, try the following:

    1. Hover over individual nodes to see their username, the number of connections they have, the number of tweets, and the number of retweets. Are there any names you already know?
    2. Use the cursor to draw boxes and zoom in on particular regions of the plot. You can always double-click to zoom back out!
    3. Inspect how the larger nodes with more connections interact with other larger nodes.

    Note: Although there are many lines of code customizing the interactive visualization, you can run this code as-is without further modification. However, if you are interested, try adjusting things like colors, text, etc.