Practical 2 Solutions

Jumping Rivers

Copy cat

This data set consists of seven observations on cotton aphid counts on twenty randomly chosen leaves in each plot, for twenty-seven treatment-block combinations. The data were recorded in July 2004 in Lamesa, Texas. The treatments consisted of three nitrogen levels (blanket, variable and none), three irrigation levels (low, medium and high) and three blocks, each being a distinct area. Irrigation treatments were randomly assigned within each block as whole plots. Nitrogen treatments were randomly assigned within each whole block as split plots.

See if you can recreate the plot below. Some hints

import jrpyvisualisation as jr
import plotly.express as px

aphids = jr.datasets.load_aphids()

aphids['Block'] = aphids['Block'].astype('category')

fig = px.scatter(
  aphids, x='Time', y='Aphids', 
  color='Block', facet_col='Water',
  facet_row='Nitrogen',
  template='plotly_white',
  category_orders={
    'Water' : ['Low', 'Medium', 'High'],
    'Nitrogen': ['Block', 'Variable', 'Zero']
  }
)
fig = fig.update_traces(mode='lines+markers')

Movies data

We can start by loading the movies data and plotly modules

import jrpyvisualisation as jr
import plotly.express as px

movies = jr.datasets.load_movies()

We will also get rid of all rows that contain missing data. This gives us a much smaller data set to work with, but ensures that every row has all variables recorded.

movies = movies.dropna()
    • Produce a plot of the budget for films against their length.
    fig = px.scatter(
      movies,
      x='length', y='budget'
    )
    • Try using the trendline argument to add a line of best fit
    fig = px.scatter(
      movies,
      x='length', y='budget',
      trendline='ols'
    )
    • Comment on the line of best fit
    comment = """
    Positive relationship between the length of movies and their budgets.
    Line predicts negative budgets for 'outlier' films so we should be 
    careful with interpretation.
    """
    • Add marginal histograms to the x and y axis
    fig = px.scatter(
      movies,
      x='length', y='budget',
      trendline='ols',
      marginal_x='histogram',
      marginal_y='histogram'
    )
    • Produce a scatter plot of the length of a movie against the title length of the film. (Hint: df['column_name'].str.len() can be used to count the number of characters of string of a column)
    movies['title_length'] = movies['title'].str.len()
    fig = px.scatter(
      movies,
      x='title_length', y='length'
    )
    • Give the axes more informative labels
    fig = px.scatter(
      movies,
      x='title_length', y='length',
      labels = {
        'length': 'Length of movie (mins)',
        'title_length' : 'Characters in title of movie.'
      }
    )
    • Add marginal box plots to the axes
    fig = px.scatter(
      movies,
      x='title_length', y='length',
      labels = {
        'length': 'Length of movie (mins)',
        'title_length' : 'Characters in title of movie.'
      },
      marginal_x='box',
      marginal_y='box'
    )
    • Mark each point with a different colour, depending on its mpaa rating. Also add a line of best fit for each rating.
    fig = px.scatter(
      movies,
      x='title_length', y='length',
      labels = {
        'length': 'Length of movie (mins)',
        'title_length' : 'Characters in title of movie.'
      },
      marginal_x='box',
      marginal_y='box',
      color='mpaa',
      trendline='ols'
    )
    • Comment on the lines of best fit added
    comment = """
    Most groups do not show much of a relationship between the length 
    of a movie in minutes and the number of characters in the title.
    
    However there does appear to be a fairly strong positive relationship
    for films rated as NC-17, but, there are few films with this rating so
    we should be careful about any conclusions we draw.
    """
    • Fix your plot such that as you hover over points you also get shown the title of the film.
    fig = px.scatter(
      movies,
      x='title_length', y='length',
      labels = {
        'length': 'Length of movie (mins)',
        'title_length' : 'Characters in title of movie.'
      },
      marginal_x='box',
      marginal_y='box',
      color='mpaa',
      trendline='ols',
      hover_name='title'
    )