src package¶

Submodules¶

src.albums_database module¶

Create and manipulate a relational database for holding album data.

class src.albums_database.AlbumManager(app=None, engine_string=None)[source]¶

Bases: object

Manages Flask <-> SQLAlchemy connection and adds data to database.

add_album(album: str, artist: str, reviewauthor: str, score: float, releaseyear: int, reviewdate: datetime.datetime, recordlabel: str, genre: str, danceability: float, energy: float, key: float, loudness: float, speechiness: float, acousticness: float, instrumentalness: float, liveness: float, valence: float, tempo: float) → None[source]¶

Seed an existing database with additional albums.

Parameters

album (str) – Album title
artist (str) – Artist
reviewauthor (str) – Name of reviewing author
score (float) – Pitchfork rating
releaseyear (int) – Album release year
reviewdate (str) – Album review date (%B %d %Y)
recordlabel (str) – Album’s record label(s)
genre (str) – Album genre
danceability (float) – Spotify danceability score
energy (float) – Spotify energy score
key (float) – Spotify key score
loudness (float) – Spotify loudness score
speechiness (float) – Spotify speechiness score
acousticness (float) – Spotify acousticness score
instrumentalness (float) – Spotify instrumentalness score
liveness (float) – Spotify liveness score
valence (float) – Spotify valence score
tempo (float) – Spotify tempo score

Returns

None

close() → None[source]¶

Close the current SQLAlchemy session.

Returns: None

ingest_dataset(file_or_path: str) → None[source]¶

Add entries from a CSV file to the database.

Parameters: file_or_path (str) – Location of dataset to load into database
Returns: None
Raises: ValueError –

class src.albums_database.Albums(**kwargs)[source]¶

Bases: sqlalchemy.ext.declarative.api.Base

Create a data model for the database to capture albums.

acousticness¶

album¶

artist¶

danceability¶

energy¶

genre¶

id¶

instrumentalness¶

key¶

liveness¶

loudness¶

recordlabel¶

releaseyear¶

reviewauthor¶

reviewdate¶

score¶

speechiness¶

tempo¶

valence¶

src.albums_database.create_db(engine_string: str) → None[source]¶: Create database from provided engine string.

src.albums_database.delete_db(engine_string: str) → None[source]¶: Delete database from provided engine string.

src.clean module¶

Clean the dataset before modeling.

src.clean.approximate_missing_year(data: pandas.core.frame.DataFrame, fill_column: str = 'releaseyear', approximate_with: str = 'reviewdate') → pandas.core.frame.DataFrame[source]¶

Fill missing values in one column with the year of another datetime column.

Parameters

data (pandas.DataFrame) – DataFrame to clean
fill_column (str) – Name of column to fill values in
approximate_with (str) – Name of datetime column to pull year from

Returns

Cleaned pandas.DataFrame

src.clean.bucket_values_together(data: pandas.core.frame.DataFrame, colname: str, values: list, replace_with: list) → pandas.core.frame.DataFrame[source]¶

Replace one or more values with a single value.

Parameters

data (pandas.DataFrame) – DataFrame to clean
colname (str) – Name of column to apply transformation to
values (iterable) – Iterable of values to replace.
replace_with – Value to replace with.

Returns

Cleaned pandas.DataFrame

Raises

TypeError – Python is simply a list of characters, this doesn’t immediately register as bad input, and logically doesn’t really make sense for this method.

src.clean.clean_dataset(data: pandas.core.frame.DataFrame, config) → pandas.core.frame.DataFrame[source]¶

Perform full data processing pipeline.

Parameters

data (pandas.DataFrame) – Raw data
config (dict) – Config file as read in by PyYAML

Returns

pandas.DataFrame of cleaned data

src.clean.convert_datetime_to_date(data: pandas.core.frame.DataFrame, colname: str = 'reviewdate') → pandas.core.frame.DataFrame[source]¶

Remove the time component of a datetime column.

Parameters

data (pandas.DataFrame) – DataFrame to clean
colname (str, optional) – Name of column to apply transformation to. Defaults to “reviewdate”.

Returns

Cleaned pandas.DataFrame

src.clean.convert_str_to_datetime(data: pandas.core.frame.DataFrame, colname: str = 'reviewdate', datetime_format: str = '%B %d %Y') → pandas.core.frame.DataFrame[source]¶

Parse a string column to datetime format.

Parameters

data (pandas.DataFrame) – DataFrame to clean
colname (str, optional) – Name of column to apply transformation to. Defaults to “reviewdate”.
datetime_format (str, optional) – Datetime format of column. Defaults to “%B %d %Y”. For more info on these codes: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes.

Returns

Cleaned pandas.DataFrame

src.clean.fill_missing_manually(data: pandas.core.frame.DataFrame, colname: str = 'recordlabel', fill_with: tuple = ("Fool's Gold", 'Vapor', '101 Distribution', 'Jet Life', 'Espo', 'Cinematic', 'Def Jam', 'LM Dupli-Cation', 'Glory Boyz', 'Epic', 'Self-released', 'Cash Money', 'Grand Hustle', 'Vice', 'Free Bandz', 'Six Shooter Records', 'Self-released', 'Maybach', 'Self-released', 'Top Dawg', 'Triple X', '1017', 'Rostrum', 'BasedWorld', '10.Deep', 'Self-released')) → pandas.core.frame.DataFrame[source]¶

Manually fill missing values.

Parameters

data (pandas.DataFrame) – DataFrame to clean
colname (str, optional) – Name of column to apply transformation to. Defaults to “recordlabel”.
fill_with (iterable) – Corrected values to replace missing values with. Data type depends on the column being filled.

Returns

Cleaned pandas.DataFrame

src.clean.fill_na_with_str(data: pandas.core.frame.DataFrame, colname: str = 'genre', fill_string: str = 'Missing') → pandas.core.frame.DataFrame[source]¶

Fill NA values with a string value.

Parameters

data (pandas.DataFrame) – DataFrame to clean
colname (str, optional) – Name of column to apply transformation to. Defaults to “genre”.
fill_string (str, optional) – String to replace missing values with. Defaults to “Missing”.

Returns

Cleaned pandas.DataFrame

src.clean.strip_whitespace(data: pandas.core.frame.DataFrame, colname: str = 'recordlabel') → pandas.core.frame.DataFrame[source]¶

Trim extra whitespace from values in a column.

Parameters

data (pandas.DataFrame) – DataFrame to clean
colname (str, optional) – Name of column to apply transformation to. Defaults to “recordlabel”.

Returns

Cleaned pandas.DataFrame

src.evaluate_performance module¶

Evaluate the performance of a model through its predictions.

src.evaluate_performance.evaluate_model(results_data: pandas.core.frame.DataFrame, y_true_colname: str, y_pred_colname: str) → pandas.core.frame.DataFrame[source]¶

Evaluate performance against a variety of regression metrics:

MSE
RMSE
MAD
R-squared
Max error

Parameters

results_data (pandas.DataFrame) – DataFrame containing (at least) predicted and ground truth values
y_true_colname (str) – Name of column containing true values
y_pred_colname (str) – Name of column containing predicted values

Returns

pandas.DataFrame containing metrics and values

src.load_data module¶

Move data between a local filesystem and S3 bucket.

src.load_data.download_file_from_s3(local_path: str, s3path: str) → None[source]¶

Download a file from S3.

Parameters

local_path (str) – Destination file or path on local machine
s3path (str) – File or path to download from S3

Returns

None

src.load_data.download_from_s3_pandas(local_path: str, s3path: str, sep: str = ',') → None[source]¶

Download a pandas.DataFrame from S3.

Parameters

local_path (str) – Destination file or path on local machine
s3path (str) – File or path to download from S3
sep (str, optional) – Field separator in S3 file. Defaults to “,”.

Returns

None

src.load_data.download_raw_data(local_destination: str) → None[source]¶

Download the original dataset from source.

Parameters: local_destination (str) – Destination file or path on local machine
Returns: None

src.load_data.parse_s3(s3path: str) → Tuple[str, str][source]¶

Split an S3 filepath into the bucket name and subsequent path.

Parameters: s3path (str) – File path in S3 (including “s3://” prefix)
Returns: Tuple containing S3 bucket name and S3 path
Return type: tuple(str, str)
Raises: ValueError – //bucket/path”

src.load_data.upload_file_to_s3(local_path: str, s3path: str) → None[source]¶

Upload a local file to S3.

Parameters

local_path (str) – File name or path to local file to upload.
s3path (str) – Destination path in S3.

Returns

None

src.load_data.upload_to_s3_pandas(local_path: str, s3path: str, sep: str = ',') → None[source]¶

Upload a pandas.DataFrame to S3.

Parameters

local_path (str) – File name or path to local file to upload.
s3path (str) – Destination path in S3.
sep (str, optional) – Field separator. Defaults to “,”.

Returns

None

src.model module¶

Build, fit, and evaluate predictive models.

src.model.make_model(**kwargs) → sklearn.base.BaseEstimator[source]¶

Create an untrained GBT model for use in a sklearn.pipeline.Pipeline.

Parameters: **kwargs – Parameters to pass on to GBT constructor
Returns: Untrained sklearn.ensemble.GradientBoostingRegressor object

src.model.make_preprocessor(numeric_features: List[str], categorical_features: List[str], handle_unknown: str) → sklearn.compose._column_transformer.ColumnTransformer[source]¶

Define preprocessing steps for input features.

Performs standard scaling for numeric features and one-hot encoding for categorical features. All features specified for this function are processed, and _only_ these features are used when modeling. In other words, this preprocessor determines the exact input columns (and order) when training and performing inference.

Parameters

numeric_features (list(str)) – Names of numeric features to scale
categorical_features (list(str)) – Names of categorical features to one-hot encode
handle_unknown (str) – Policy for unknown categories in OneHotEncoder (either “handle_unknown” or “error”)

Returns

A sklearn.compose.ColumnTransformer with the desired transformation steps

src.model.parse_dict_to_dataframe(form_dict: dict) → pandas.core.frame.DataFrame[source]¶

Parse a dictionary to pandas.DataFrame format.

Flask forms supply data via POST requests in MultiDict format, but the model pipeline requires an input DataFrame.

Parameters

form_dict (dict) – Flask form response as a flat dictionary

Returns

pandas.DataFrame with keys as column names and values as the: associated values for each key

src.model.split_predictors_response(data: pandas.core.frame.DataFrame, target_col: str = 'score') → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶: Separate predictor variables from response variable.

src.model.split_train_val_test(features: pandas.core.frame.DataFrame, target: list, train_val_test_ratio: str, **kwargs) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶

Partition dataset into training, validation, and testing splits.

Parameters

features (pandas.DataFrame) – DataFrame of input features
target (array-like) – Values of response variable to predict
train_val_test_ratio (str) – Relative proportion of data for each of train, val, and test sets, in the form “X:Y:Z” (e.g., “6:2:2”).
**kwargs – Additional settings to pass on to train_test_split() (for example, random seed)

Returns

(X_train, X_val, X_test, y_train, y_val, y_test), each as DataFrames.: X_val and y_val are omitted if the desired ratio does not specify the size of a validation set.

src.model.train_pipeline(X_train: pandas.core.frame.DataFrame, y_train: list, preprocessor: sklearn.compose._column_transformer.ColumnTransformer, model: sklearn.base.BaseEstimator) → sklearn.pipeline.Pipeline[source]¶

Create and fit a preprocessing –> modeling pipeline.

Parameters

X_train (pandas.DataFrame) – Training features
y_train (array-like) – Training targets
(obj (preprocessor) – sklearn.compose.ColumnTransformer): ColumnTransformer defining the processing to perform for input data
model (sklearn.base.BaseEstimator) – An untrained sklearn regression model

Returns

A fitted sklearn.pipeline.Pipeline

src.model.validate_dataframe(data: pandas.core.frame.DataFrame, output_cols: List = ['artist', 'album', 'reviewauthor', 'releaseyear', 'reviewdate', 'recordlabel', 'genre', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']) → pandas.core.frame.DataFrame[source]¶

Align a DataFrame with model pipeline’s required order and names.

The model pipeline requires an input DataFrame with exactly the same columns as seen during training, and in the same order. Creates the columns that don’t exist (filling with NA).

Parameters

data (pandas.DataFrame) – Input DataFrame to validate/align
output_cols (list(str), optional) – Required columns for output DataFrame. Defaults to those seen during training. If not provided (None), no adjustment to the DataFrame’s columns is made.

Returns

Validated pandas.DataFrame

src.post_process module¶

Analyze a trained model.

src.post_process.get_feature_importance(trained_pipeline: sklearn.pipeline.Pipeline, numeric_features: List[str]) → pandas.core.series.Series[source]¶

Get feature importance measures from a trained model.

Parameters

trained_pipeline (sklearn.pipeline.Pipeline) – Fitted model pipeline
numeric_features (list(str)) – Names of numeric features

Returns

pandas.Series containing each feature and its importance

src.score_model module¶

Generate new values given a trained model and some new input.

src.score_model.append_predictions(trained_model: sklearn.pipeline.Pipeline, input_data: pandas.core.frame.DataFrame, output_col: str = 'preds') → pandas.core.frame.DataFrame[source]¶

Append predictions to an existing input DataFrame.

Parameters

trained_model (sklearn.pipeline.Pipeline) – Trained model pipeline
input_data (pandas.DataFrame) – Input data to predict on
output_col (str, optional) – Name of column to place predicted values in. Defaults to “preds”.

Returns

Input pandas.DataFrame with predictions appended as a new column

src.score_model.get_predictions(trained_model: sklearn.pipeline.Pipeline, input_data: pandas.core.frame.DataFrame) → list[source]¶

Get predicted values for input data.

Parameters

trained_model (sklearn.pipeline.Pipeline) – Trained model pipeline
input_data (pandas.DataFrame) – Input data to predict on

Returns

array-like of predicted values

src.serialize module¶

Serialize and deserialize trained model pipelines.

src.serialize.load_pipeline(load_path: str) → sklearn.pipeline.Pipeline[source]¶

Deserialize a fitted model pipeline.

Parameters: load_path (str) – Path to joblib-saved pipeline
Returns: Fitted sklearn.pipeline.Pipeline object

src.serialize.save_pipeline(pipeline: sklearn.pipeline.Pipeline, save_path: str) → None[source]¶

Serialize a fitted model pipeline.

Parameters

pipeline (sklearn.pipeline.Pipeline) – Fitted model pipeline
save_path (str) – Where to save the pipeline

Returns

None

src package¶

Submodules¶

src.albums_database module¶

src.clean module¶

src.evaluate_performance module¶

src.load_data module¶

src.model module¶

src.post_process module¶

src.score_model module¶

src.serialize module¶

Module contents¶