Georgina Woo

Instructions


This document contains a series of programming problems designed to help you practice plotting graphs with Python. Tasks 6 and 7 involve working with datapoints provided in some .csv files – They’re hyperlinked in the problem description, but they can also be found here: Problem Set Data

Matplotlib, Seaborn, and numpy.random Functions

Function

Purpose

Example Syntax

plt.plot(x, y)

Plot a line connecting x and y values.

plt.plot(years, population)

plt.title()

Add a title to the plot.

plt.title("Population Over Years")

plt.xlabel()

Label the x-axis.

plt.xlabel("Years")

plt.ylabel()

Label the y-axis.

plt.ylabel("Population (Billions)")

plt.grid()

Add gridlines to the plot.

plt.grid(True)

plt.show()

Display the plot.

plt.show()

sns.barplot()

Create a bar graph

sns.barplot(x=categories, y=values)

sns.histplot()

Plot a histogram.

sns.histplot(data, bins=10, kde=True)

sns.heatmap()

Create a heatmap

sns.heatmap(data, annot=True)

np.random.randint()

Generate random integers

np.random.randint(18, 60, 100)

np.random.rand()

Generate random float values

np.random.rand(6, 6)


Task 1: Create a Line Plot


Objective

Visualize trends in a dataset using a line plot.

Instructions:

  1. Create a dataset with two lists: years (e.g., [2000, 2005, 2010, 2015, 2020]) and population (e.g., [6.1, 6.5, 6.9, 7.3, 7.8] in billions).
  2. Use Matplotlib to create a line plot showing how the population changes over the years.
  3. Add a title, labels for the x-axis (Years) and y-axis (Population), and grid lines.

Hint: Use plt.plot() to draw the line and customize the style with arguments like color, linestyle, and marker.


Task 2: Customize a Bar Graph


Objective: Use a bar graph to compare data categories.

Instructions:

  1. Create a dataset with two lists: categories (e.g., ["A", "B", "C", "D"]) and values (e.g., [15, 25, 35, 20]).
  2. Use Seaborn's barplot() function to create a bar graph comparing the values for each category.
  3. Customize the graph:

Hint: Use sns.barplot() for the bar graph and plt.text() to annotate the bars.


Task 3: Plot a Histogram


Objective: Visualize the distribution of a dataset using a histogram.

Instructions:

  1. Create a dataset of 100 random numbers representing ages between 18 and 60 (You can use np.random.randint(18, 60, 100)), or set your own probabilities and age ranges, and use random.random() to generate some number between 0 and 1.
  2. Plot a histogram using Seaborn to visualize the age distribution.
  3. Customize the histogram:

Hint: Use sns.histplot() for the histogram and enable the kde argument.


Task 4: Create a Heatmap


Objective: Visualize correlations between data points using a heatmap.

Instructions:

  1. Create a dataset of random numbers using np.random.rand(6, 6) to represent a correlation matrix.
  2. Use Seaborn's heatmap() function to plot the heatmap.
  3. Customize the heatmap:

Hint: Use sns.heatmap() and enable the annot=True argument to show the values inside the cells.


Task 5: Create a Grouped Bar Chart for Model Accuracies


Objective: 

Visualize training and testing accuracies for different models using a grouped bar chart.

Instructions:

  1. Set Up the Data:

Example:

models = ["Model A", "Model B", "Model C", "Model D"]
training_accuracies = [88, 92, 85, 90]
testing_accuracies = [82, 87, 80, 86]

  1. Prepare for Plotting:
  1. Choose a color palette

Example

# Pastel color palette of size 2

custom_palette = sns.color_palette("pastel", 2)

  1. Create the Bar Chart:

Example

# Plot the bars
plt.bar(x - width / 2, training_accuracies, width, label=
"Training Accuracy", color=custom_palette[0])
plt.bar(x + width / 2, testing_accuracies, width, label=
"Testing Accuracy", color=custom_palette[1])

  1. Customize the Chart:
  1. Display the Chart:


Task 6: Life Expectancy Visualization and Comparison


Data Source: UN WPP (2024); HMD (2024); Zijdeman et al. (2015); Riley (2005)OurWorldinData.org/life-expectancy

Your task is to analyze life expectancy data from a CSV file, focusing on specific countries, and compare their trends over time.

Step 1: Load and Prepare the Dataset

Load the life expectancy dataset and make it easier to work with.

Instructions:

  1. Load the life-expectancy.csv file into a pandas DataFrame.
  2. Rename the last column to "Life Expectancy", and the “Entity” column to “Country.
  3. Convert the "Country" and "Code" columns to lowercase for easier matching.
  4. Display the first 5 rows of the DataFrame to ensure it's loaded correctly.

Hint: To rename columns, use the rename() method. To convert column names to lowercase, use .str.lower() on the column.

Example:

# Rename a column
df.rename(columns={
'Old Column Name': 'New Column Name'}, inplace=True)

# Convert column names to lowercase
df.columns = df.columns.str.lower()

# Convert data in "country" and "code" columns to lowercase for easier matching

df['country'] = df['country'].str.lower()

df['code'] = df['code'].str.lower()

Step 2: Filter by Country or Code

Ask the user to select one or more countries (using their names or codes) for analysis.

Instructions:

  1. Prompt the user to input the country name(s) or code(s), separated by commas.
  2. Match the user input to the "country" or "code" columns in the DataFrame.
  3. Print the matched countries to confirm the selection.

Hint: Use pandas filtering to check if the input matches values in the "country" or "code" columns.

Example:

# Check if a value exists in a column
if "usa" in df['country'].values:
   
print("USA found!")

Step 3: Identify the Minimum Common Year

Find the earliest year for which all the selected countries have data.

Instructions:

  1. Iterate through the years for each country.
  2. Find the intersection of years across the selected countries.
  3. Print the earliest common year.

Hint: Use Python sets to find the common years, then the min() function to find the minimum year.

Example:

# Find the intersection of two sets
years1 = {2000, 2001, 2002}
years2 = {2001, 2002, 2003}
common_years = years1 & years2
print("Common Years:", common_years)

Step 4: Filter Data by Year

Filter the DataFrame to only include rows from the minimum common year onward.

Instructions:

  1. Use pandas filtering to subset the DataFrame.
  2. Preview the filtered data.

Hint: To filter data by a condition, use pandas slicing.

Example:

# Filter rows where column values are greater than or equal to a threshold
filtered_df = df[df[
'year'] >= 1950]

Step 5: Plot the Data

Plot the life expectancy data for the selected countries, either alone or compared to the world average.

Instructions:

  1. Use matplotlib to create the plots:
  1. Add titles, axis labels, and a legend.

Hint: Use the plot() function from matplotlib to plot line graphs.

Example:

# Plot a line graph
plt.plot(df[
'year'], df['life expectancy'], label="Country Name")
plt.legend()
plt.show()

Step 6: Customize the Plot (Optional)

Experiment with customizing the plot by changing colors, adding gridlines, or modifying line styles.

Task:

  1. Change the color of the lines for the countries.
  2. Add a grid to the plot.
  3. Save the plot as a PNG file using plt.savefig("filename.png").

Hint:

Example:

# Add gridlines and save the plot
plt.grid(True)
plt.savefig(
"life_expectancy_plot.png")


Task 7: Fitting Curves to Mystery Datasets


Objective:

You are provided with three datasets (mystery1.csv, mystery2.csv, mystery3.csv) in the Drive folder. Each .csv contains 2 columns of x and y values. Your task is to:

  1. Load the data and visualize it using an appropriate plot.
  2. Try fitting three types of curves (linear, exponential, sinusoidal) to the data.
  3. Compare the fits using mean squared error (MSE) to decide which curve fits the data best.

Instructions:

  1. Load and Visualize the Data
  1. Hint: A scatter plot is a great way to see relationships between xxx and yyy values.
  2. Define Functions for Curve Fitting
  1. Fit the Curves

amplitude_guess = (y.max() - y.min()) / 2
frequency_guess = 2 * np.pi / (x.max() - x.min())
phase_guess = 0

  1. Compare Fits
  1. Decide the Best Fit

If you’re stuck:

#List to store our results for each curve fit attempt

results = []

# Linear fit
popt_linear, _ = curve_fit(linear_func, x, y)
y_linear_fit = linear_func(x, *popt_linear)
mse_linear = np.mean((y - y_linear_fit) ** 2)
results.append((
"Linear", y_linear_fit, mse_linear, "red"))

# Exponential fit
popt_exponential, _ = curve_fit(exponential_func, x, y, maxfev=10000) y_exponential_fit = exponential_func(x, *popt_exponential) mse_exponential = np.mean((y - y_exponential_fit) ** 2) results.append((
"Exponential", y_exponential_fit, mse_exponential, "green"))

# Sinusoidal fit (after estimating initial parameters)
popt_sinusoidal, _ = curve_fit( sinusoidal_func, x, y, p0=[amplitude_guess, frequency_guess, phase_guess], maxfev=10000 )

y_sinusoidal_fit = sinusoidal_func(x, *popt_sinusoidal)

mse_sinusoidal = np.mean((y - y_sinusoidal_fit) ** 2) results.append(("Sinusoidal", y_sinusoidal_fit, mse_sinusoidal, "purple"))

# Plot the original data and best-fit curves
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label=
"Original Data", color="blue")
for name, fit, mse, color in results:
   plt.plot(x, fit, label=f
"{name} Fit (MSE: {mse:.2f})", color=color)
# Title, label, and show the graph