Georgina Woo

Instructions

This document contains a series of programming problems designed to help you practice plotting graphs with Python. Tasks 6 and 7 involve working with datapoints provided in some .csv files – They’re hyperlinked in the problem description, but they can also be found here: Problem Set Data

Matplotlib, Seaborn, and numpy.random Functions

Function	Purpose	Example Syntax
plt.plot(x, y)	Plot a line connecting x and y values.	plt.plot(years, population)
plt.title()	Add a title to the plot.	plt.title("Population Over Years")
plt.xlabel()	Label the x-axis.	plt.xlabel("Years")
plt.ylabel()	Label the y-axis.	plt.ylabel("Population (Billions)")
plt.grid()	Add gridlines to the plot.	plt.grid(True)
plt.show()	Display the plot.	plt.show()
sns.barplot()	Create a bar graph	sns.barplot(x=categories, y=values)
sns.histplot()	Plot a histogram.	sns.histplot(data, bins=10, kde=True)
sns.heatmap()	Create a heatmap	sns.heatmap(data, annot=True)
np.random.randint()	Generate random integers	np.random.randint(18, 60, 100)
np.random.rand()	Generate random float values	np.random.rand(6, 6)

Task 1: Create a Line Plot

Objective

Visualize trends in a dataset using a line plot.

Instructions:

Create a dataset with two lists: years (e.g., [2000, 2005, 2010, 2015, 2020]) and population (e.g., [6.1, 6.5, 6.9, 7.3, 7.8] in billions).
Use Matplotlib to create a line plot showing how the population changes over the years.
Add a title, labels for the x-axis (Years) and y-axis (Population), and grid lines.

Hint: Use plt.plot() to draw the line and customize the style with arguments like color, linestyle, and marker.

Task 2: Customize a Bar Graph

Objective: Use a bar graph to compare data categories.

Instructions:

Create a dataset with two lists: categories (e.g., ["A", "B", "C", "D"]) and values (e.g., [15, 25, 35, 20]).
Use Seaborn's barplot() function to create a bar graph comparing the values for each category.
Customize the graph:

Set the bar color to purple.
Add a title and label the axes.
Add numerical labels on top of each bar showing the exact value.

Hint: Use sns.barplot() for the bar graph and plt.text() to annotate the bars.

Task 3: Plot a Histogram

Objective: Visualize the distribution of a dataset using a histogram.

Instructions:

Create a dataset of 100 random numbers representing ages between 18 and 60 (You can use np.random.randint(18, 60, 100)), or set your own probabilities and age ranges, and use random.random() to generate some number between 0 and 1.
Plot a histogram using Seaborn to visualize the age distribution.
Customize the histogram:

Use 10 bins.
Set the color to orange.
Add a KDE (Kernel Density Estimate) curve.

Hint: Use sns.histplot() for the histogram and enable the kde argument.

Task 4: Create a Heatmap

Objective: Visualize correlations between data points using a heatmap.

Instructions:

Create a dataset of random numbers using np.random.rand(6, 6) to represent a correlation matrix.
Use Seaborn's heatmap() function to plot the heatmap.
Customize the heatmap:

Set a color map (e.g., "coolwarm").
Display values inside the cells.
Add a title.

Hint: Use sns.heatmap() and enable the annot=True argument to show the values inside the cells.

Task 5: Create a Grouped Bar Chart for Model Accuracies

Objective:

Visualize training and testing accuracies for different models using a grouped bar chart.

Instructions:

Set Up the Data:

Create a list of model names (e.g., "Model A", "Model B", etc.).
Define training accuracies for each model as a list of percentages.
Define testing accuracies for each model as a list of percentages.

Example:

models = ["Model A", "Model B", "Model C", "Model D"]
training_accuracies = [88, 92, 85, 90]
testing_accuracies = [82, 87, 80, 86]

Prepare for Plotting:

Calculate the positions for each bar using np.arange().
Hint: Use the length of one of the lists above.
Set the width for the bars (e.g., 0.35).

Choose a color palette

Example

# Pastel color palette of size 2

custom_palette = sns.color_palette("pastel", 2)

Create the Bar Chart:

Use Matplotlib's plt.bar() to create grouped bars:

Plot training accuracies to the left of the center for each model.
Plot testing accuracies to the right of the center for each model.

Example

# Plot the bars
plt.bar(x - width / 2, training_accuracies, width, label="Training Accuracy", color=custom_palette[0])
plt.bar(x + width / 2, testing_accuracies, width, label="Testing Accuracy", color=custom_palette[1])

Customize the Chart:

Add a title, axis labels, and a legend.
Use plt.xticks() to label the x-axis with model names.

Display the Chart:

Use plt.show() to render the chart.

Task 6: Life Expectancy Visualization and Comparison

Data Source: UN WPP (2024); HMD (2024); Zijdeman et al. (2015); Riley (2005)OurWorldinData.org/life-expectancy

Your task is to analyze life expectancy data from a CSV file, focusing on specific countries, and compare their trends over time.

Step 1: Load and Prepare the Dataset

Load the life expectancy dataset and make it easier to work with.

Instructions:

Load the life-expectancy.csv file into a pandas DataFrame.
Rename the last column to "Life Expectancy", and the “Entity” column to “Country.
Convert the "Country" and "Code" columns to lowercase for easier matching.
Display the first 5 rows of the DataFrame to ensure it's loaded correctly.

Hint: To rename columns, use the rename() method. To convert column names to lowercase, use .str.lower() on the column.

Example:

# Rename a column
df.rename(columns={'Old Column Name': 'New Column Name'}, inplace=True)

# Convert column names to lowercase
df.columns = df.columns.str.lower()

# Convert data in "country" and "code" columns to lowercase for easier matching

df['country'] = df['country'].str.lower()

df['code'] = df['code'].str.lower()

Step 2: Filter by Country or Code

Ask the user to select one or more countries (using their names or codes) for analysis.

Instructions:

Prompt the user to input the country name(s) or code(s), separated by commas.
Match the user input to the "country" or "code" columns in the DataFrame.
Print the matched countries to confirm the selection.

Hint: Use pandas filtering to check if the input matches values in the "country" or "code" columns.

Example:

# Check if a value exists in a column
if "usa" in df['country'].values:
print("USA found!")

Step 3: Identify the Minimum Common Year

Find the earliest year for which all the selected countries have data.

Instructions:

Iterate through the years for each country.
Find the intersection of years across the selected countries.
Print the earliest common year.

Hint: Use Python sets to find the common years, then the min() function to find the minimum year.

Example:

# Find the intersection of two sets
years1 = {2000, 2001, 2002}
years2 = {2001, 2002, 2003}
common_years = years1 & years2
print("Common Years:", common_years)

Step 4: Filter Data by Year

Filter the DataFrame to only include rows from the minimum common year onward.

Instructions:

Use pandas filtering to subset the DataFrame.
Preview the filtered data.

Hint: To filter data by a condition, use pandas slicing.

Example:

# Filter rows where column values are greater than or equal to a threshold
filtered_df = df[df['year'] >= 1950]

Step 5: Plot the Data

Plot the life expectancy data for the selected countries, either alone or compared to the world average.

Instructions:

Use matplotlib to create the plots:

If plotting alone: Plot one line for each country.
If comparing with the world average: Add an additional line for the world average.

Add titles, axis labels, and a legend.

Hint: Use the plot() function from matplotlib to plot line graphs.

Example:

# Plot a line graph
plt.plot(df['year'], df['life expectancy'], label="Country Name")
plt.legend()
plt.show()

Step 6: Customize the Plot (Optional)

Experiment with customizing the plot by changing colors, adding gridlines, or modifying line styles.

Task:

Change the color of the lines for the countries.
Add a grid to the plot.
Save the plot as a PNG file using plt.savefig("filename.png").

Hint:

Use plt.grid(True) for gridlines.
Use plt.savefig() to save the plot.

Example:

# Add gridlines and save the plot
plt.grid(True)
plt.savefig("life_expectancy_plot.png")

Task 7: Fitting Curves to Mystery Datasets

Objective:

You are provided with three datasets (mystery1.csv, mystery2.csv, mystery3.csv) in the Drive folder. Each .csv contains 2 columns of x and y values. Your task is to:

Load the data and visualize it using an appropriate plot.
Try fitting three types of curves (linear, exponential, sinusoidal) to the data.
Compare the fits using mean squared error (MSE) to decide which curve fits the data best.

Instructions:

Load and Visualize the Data

Use pandas to read the dataset.
Plot the data using a scatter plot to visualize its shape.

Hint: A scatter plot is a great way to see relationships between xxx and yyy values.
Define Functions for Curve Fitting

Linear:
Exponential:
Sinusoidal:
Hint: These functions will serve as the mathematical models for curve fitting. Define them using def in Python.

Fit the Curves

Use scipy.optimize.curve_fit to fit each function to the data.
For exponential and sinusiodal fitting:

Increase maxfev=10000 to ensure the optimization has enough iterations to converge. Maxfev -> Maximum function evaluations.

For sinusoidal fitting, provide initial guesses for:

Start with 0.

amplitude_guess = (y.max() - y.min()) / 2
frequency_guess = 2 * np.pi / (x.max() - x.min())
phase_guess = 0

Compare Fits

Plot the original data along with the fitted curves (linear, exponential, sinusoidal).
Calculate and display the mean squared error (MSE) for each fit.

Decide the Best Fit

Based on the MSE and the visual appearance of the fit, decide which curve describes the data best.

If you’re stuck:

#List to store our results for each curve fit attempt

results = []

# Linear fit
popt_linear, _ = curve_fit(linear_func, x, y)
y_linear_fit = linear_func(x, *popt_linear)
mse_linear = np.mean((y - y_linear_fit) ** 2)
results.append(("Linear", y_linear_fit, mse_linear, "red"))

# Exponential fit
popt_exponential, _ = curve_fit(exponential_func, x, y, maxfev=10000) y_exponential_fit = exponential_func(x, *popt_exponential) mse_exponential = np.mean((y - y_exponential_fit) ** 2) results.append(("Exponential", y_exponential_fit, mse_exponential, "green"))

# Sinusoidal fit (after estimating initial parameters)
popt_sinusoidal, _ = curve_fit( sinusoidal_func, x, y, p0=[amplitude_guess, frequency_guess, phase_guess], maxfev=10000 )

y_sinusoidal_fit = sinusoidal_func(x, *popt_sinusoidal)

mse_sinusoidal = np.mean((y - y_sinusoidal_fit) ** 2) results.append(("Sinusoidal", y_sinusoidal_fit, mse_sinusoidal, "purple"))

# Plot the original data and best-fit curves
plt.figure(figsize=(10, 6))
plt.scatter(x, y, label="Original Data", color="blue")
for name, fit, mse, color in results:
plt.plot(x, fit, label=f"{name} Fit (MSE: {mse:.2f})", color=color)
# Title, label, and show the graph