src package

Submodules

src.albums_database module

Create and manipulate a relational database for holding album data.

class src.albums_database.AlbumManager(app=None, engine_string=None)[source]

Bases: object

Manages Flask <-> SQLAlchemy connection and adds data to database.

add_album(album: str, artist: str, reviewauthor: str, score: float, releaseyear: int, reviewdate: datetime.datetime, recordlabel: str, genre: str, danceability: float, energy: float, key: float, loudness: float, speechiness: float, acousticness: float, instrumentalness: float, liveness: float, valence: float, tempo: float)None[source]

Seed an existing database with additional albums.

Parameters
  • album (str) – Album title

  • artist (str) – Artist

  • reviewauthor (str) – Name of reviewing author

  • score (float) – Pitchfork rating

  • releaseyear (int) – Album release year

  • reviewdate (str) – Album review date (%B %d %Y)

  • recordlabel (str) – Album’s record label(s)

  • genre (str) – Album genre

  • danceability (float) – Spotify danceability score

  • energy (float) – Spotify energy score

  • key (float) – Spotify key score

  • loudness (float) – Spotify loudness score

  • speechiness (float) – Spotify speechiness score

  • acousticness (float) – Spotify acousticness score

  • instrumentalness (float) – Spotify instrumentalness score

  • liveness (float) – Spotify liveness score

  • valence (float) – Spotify valence score

  • tempo (float) – Spotify tempo score

Returns

None

close()None[source]

Close the current SQLAlchemy session.

Returns

None

ingest_dataset(file_or_path: str)None[source]

Add entries from a CSV file to the database.

Parameters

file_or_path (str) – Location of dataset to load into database

Returns

None

Raises

ValueError

class src.albums_database.Albums(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Create a data model for the database to capture albums.

acousticness
album
artist
danceability
energy
genre
id
instrumentalness
key
liveness
loudness
recordlabel
releaseyear
reviewauthor
reviewdate
score
speechiness
tempo
valence
src.albums_database.create_db(engine_string: str)None[source]

Create database from provided engine string.

src.albums_database.delete_db(engine_string: str)None[source]

Delete database from provided engine string.

src.clean module

Clean the dataset before modeling.

src.clean.approximate_missing_year(data: pandas.core.frame.DataFrame, fill_column: str = 'releaseyear', approximate_with: str = 'reviewdate')pandas.core.frame.DataFrame[source]

Fill missing values in one column with the year of another datetime column.

Parameters
  • data (pandas.DataFrame) – DataFrame to clean

  • fill_column (str) – Name of column to fill values in

  • approximate_with (str) – Name of datetime column to pull year from

Returns

Cleaned pandas.DataFrame

src.clean.bucket_values_together(data: pandas.core.frame.DataFrame, colname: str, values: list, replace_with: list)pandas.core.frame.DataFrame[source]

Replace one or more values with a single value.

Parameters
  • data (pandas.DataFrame) – DataFrame to clean

  • colname (str) – Name of column to apply transformation to

  • values (iterable) – Iterable of values to replace.

  • replace_with – Value to replace with.

Returns

Cleaned pandas.DataFrame

Raises

TypeError – Python is simply a list of characters, this doesn’t immediately register as bad input, and logically doesn’t really make sense for this method.

src.clean.clean_dataset(data: pandas.core.frame.DataFrame, config)pandas.core.frame.DataFrame[source]

Perform full data processing pipeline.

Parameters
  • data (pandas.DataFrame) – Raw data

  • config (dict) – Config file as read in by PyYAML

Returns

pandas.DataFrame of cleaned data

src.clean.convert_datetime_to_date(data: pandas.core.frame.DataFrame, colname: str = 'reviewdate')pandas.core.frame.DataFrame[source]

Remove the time component of a datetime column.

Parameters
  • data (pandas.DataFrame) – DataFrame to clean

  • colname (str, optional) – Name of column to apply transformation to. Defaults to “reviewdate”.

Returns

Cleaned pandas.DataFrame

src.clean.convert_str_to_datetime(data: pandas.core.frame.DataFrame, colname: str = 'reviewdate', datetime_format: str = '%B %d %Y')pandas.core.frame.DataFrame[source]

Parse a string column to datetime format.

Parameters
Returns

Cleaned pandas.DataFrame

src.clean.fill_missing_manually(data: pandas.core.frame.DataFrame, colname: str = 'recordlabel', fill_with: tuple = ("Fool's Gold", 'Vapor', '101 Distribution', 'Jet Life', 'Espo', 'Cinematic', 'Def Jam', 'LM Dupli-Cation', 'Glory Boyz', 'Epic', 'Self-released', 'Cash Money', 'Grand Hustle', 'Vice', 'Free Bandz', 'Six Shooter Records', 'Self-released', 'Maybach', 'Self-released', 'Top Dawg', 'Triple X', '1017', 'Rostrum', 'BasedWorld', '10.Deep', 'Self-released'))pandas.core.frame.DataFrame[source]

Manually fill missing values.

Parameters
  • data (pandas.DataFrame) – DataFrame to clean

  • colname (str, optional) – Name of column to apply transformation to. Defaults to “recordlabel”.

  • fill_with (iterable) – Corrected values to replace missing values with. Data type depends on the column being filled.

Returns

Cleaned pandas.DataFrame

src.clean.fill_na_with_str(data: pandas.core.frame.DataFrame, colname: str = 'genre', fill_string: str = 'Missing')pandas.core.frame.DataFrame[source]

Fill NA values with a string value.

Parameters
  • data (pandas.DataFrame) – DataFrame to clean

  • colname (str, optional) – Name of column to apply transformation to. Defaults to “genre”.

  • fill_string (str, optional) – String to replace missing values with. Defaults to “Missing”.

Returns

Cleaned pandas.DataFrame

src.clean.strip_whitespace(data: pandas.core.frame.DataFrame, colname: str = 'recordlabel')pandas.core.frame.DataFrame[source]

Trim extra whitespace from values in a column.

Parameters
  • data (pandas.DataFrame) – DataFrame to clean

  • colname (str, optional) – Name of column to apply transformation to. Defaults to “recordlabel”.

Returns

Cleaned pandas.DataFrame

src.evaluate_performance module

Evaluate the performance of a model through its predictions.

src.evaluate_performance.evaluate_model(results_data: pandas.core.frame.DataFrame, y_true_colname: str, y_pred_colname: str)pandas.core.frame.DataFrame[source]

Evaluate performance against a variety of regression metrics:

  • MSE

  • RMSE

  • MAD

  • R-squared

  • Max error

Parameters
  • results_data (pandas.DataFrame) – DataFrame containing (at least) predicted and ground truth values

  • y_true_colname (str) – Name of column containing true values

  • y_pred_colname (str) – Name of column containing predicted values

Returns

pandas.DataFrame containing metrics and values

src.load_data module

Move data between a local filesystem and S3 bucket.

Copyright 2020, Chloe Mawer

src.load_data.download_file_from_s3(local_path: str, s3path: str)None[source]

Download a file from S3.

Parameters
  • local_path (str) – Destination file or path on local machine

  • s3path (str) – File or path to download from S3

Returns

None

src.load_data.download_from_s3_pandas(local_path: str, s3path: str, sep: str = ',')None[source]

Download a pandas.DataFrame from S3.

Parameters
  • local_path (str) – Destination file or path on local machine

  • s3path (str) – File or path to download from S3

  • sep (str, optional) – Field separator in S3 file. Defaults to “,”.

Returns

None

src.load_data.download_raw_data(local_destination: str)None[source]

Download the original dataset from source.

Parameters

local_destination (str) – Destination file or path on local machine

Returns

None

src.load_data.parse_s3(s3path: str)Tuple[str, str][source]

Split an S3 filepath into the bucket name and subsequent path.

Parameters

s3path (str) – File path in S3 (including “s3://” prefix)

Returns

Tuple containing S3 bucket name and S3 path

Return type

tuple(str, str)

Raises

ValueError – //bucket/path”

src.load_data.upload_file_to_s3(local_path: str, s3path: str)None[source]

Upload a local file to S3.

Parameters
  • local_path (str) – File name or path to local file to upload.

  • s3path (str) – Destination path in S3.

Returns

None

src.load_data.upload_to_s3_pandas(local_path: str, s3path: str, sep: str = ',')None[source]

Upload a pandas.DataFrame to S3.

Parameters
  • local_path (str) – File name or path to local file to upload.

  • s3path (str) – Destination path in S3.

  • sep (str, optional) – Field separator. Defaults to “,”.

Returns

None

src.model module

Build, fit, and evaluate predictive models.

src.model.make_model(**kwargs)sklearn.base.BaseEstimator[source]

Create an untrained GBT model for use in a sklearn.pipeline.Pipeline.

Parameters

**kwargs – Parameters to pass on to GBT constructor

Returns

Untrained sklearn.ensemble.GradientBoostingRegressor object

src.model.make_preprocessor(numeric_features: List[str], categorical_features: List[str], handle_unknown: str)sklearn.compose._column_transformer.ColumnTransformer[source]

Define preprocessing steps for input features.

Performs standard scaling for numeric features and one-hot encoding for categorical features. All features specified for this function are processed, and _only_ these features are used when modeling. In other words, this preprocessor determines the exact input columns (and order) when training and performing inference.

Parameters
  • numeric_features (list(str)) – Names of numeric features to scale

  • categorical_features (list(str)) – Names of categorical features to one-hot encode

  • handle_unknown (str) – Policy for unknown categories in OneHotEncoder (either “handle_unknown” or “error”)

Returns

A sklearn.compose.ColumnTransformer with the desired transformation steps

src.model.parse_dict_to_dataframe(form_dict: dict)pandas.core.frame.DataFrame[source]

Parse a dictionary to pandas.DataFrame format.

Flask forms supply data via POST requests in MultiDict format, but the model pipeline requires an input DataFrame.

Parameters

form_dict (dict) – Flask form response as a flat dictionary

Returns

pandas.DataFrame with keys as column names and values as the

associated values for each key

src.model.split_predictors_response(data: pandas.core.frame.DataFrame, target_col: str = 'score')Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]

Separate predictor variables from response variable.

src.model.split_train_val_test(features: pandas.core.frame.DataFrame, target: list, train_val_test_ratio: str, **kwargs)Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]

Partition dataset into training, validation, and testing splits.

Parameters
  • features (pandas.DataFrame) – DataFrame of input features

  • target (array-like) – Values of response variable to predict

  • train_val_test_ratio (str) – Relative proportion of data for each of train, val, and test sets, in the form “X:Y:Z” (e.g., “6:2:2”).

  • **kwargs – Additional settings to pass on to train_test_split() (for example, random seed)

Returns

(X_train, X_val, X_test, y_train, y_val, y_test), each as DataFrames.

X_val and y_val are omitted if the desired ratio does not specify the size of a validation set.

src.model.train_pipeline(X_train: pandas.core.frame.DataFrame, y_train: list, preprocessor: sklearn.compose._column_transformer.ColumnTransformer, model: sklearn.base.BaseEstimator)sklearn.pipeline.Pipeline[source]

Create and fit a preprocessing –> modeling pipeline.

Parameters
  • X_train (pandas.DataFrame) – Training features

  • y_train (array-like) – Training targets

  • (obj (preprocessor) – sklearn.compose.ColumnTransformer): ColumnTransformer defining the processing to perform for input data

  • model (sklearn.base.BaseEstimator) – An untrained sklearn regression model

Returns

A fitted sklearn.pipeline.Pipeline

src.model.validate_dataframe(data: pandas.core.frame.DataFrame, output_cols: List = ['artist', 'album', 'reviewauthor', 'releaseyear', 'reviewdate', 'recordlabel', 'genre', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'])pandas.core.frame.DataFrame[source]

Align a DataFrame with model pipeline’s required order and names.

The model pipeline requires an input DataFrame with exactly the same columns as seen during training, and in the same order. Creates the columns that don’t exist (filling with NA).

Parameters
  • data (pandas.DataFrame) – Input DataFrame to validate/align

  • output_cols (list(str), optional) – Required columns for output DataFrame. Defaults to those seen during training. If not provided (None), no adjustment to the DataFrame’s columns is made.

Returns

Validated pandas.DataFrame

src.post_process module

Analyze a trained model.

src.post_process.get_feature_importance(trained_pipeline: sklearn.pipeline.Pipeline, numeric_features: List[str])pandas.core.series.Series[source]

Get feature importance measures from a trained model.

Parameters
  • trained_pipeline (sklearn.pipeline.Pipeline) – Fitted model pipeline

  • numeric_features (list(str)) – Names of numeric features

Returns

pandas.Series containing each feature and its importance

src.score_model module

Generate new values given a trained model and some new input.

src.score_model.append_predictions(trained_model: sklearn.pipeline.Pipeline, input_data: pandas.core.frame.DataFrame, output_col: str = 'preds')pandas.core.frame.DataFrame[source]

Append predictions to an existing input DataFrame.

Parameters
  • trained_model (sklearn.pipeline.Pipeline) – Trained model pipeline

  • input_data (pandas.DataFrame) – Input data to predict on

  • output_col (str, optional) – Name of column to place predicted values in. Defaults to “preds”.

Returns

Input pandas.DataFrame with predictions appended as a new column

src.score_model.get_predictions(trained_model: sklearn.pipeline.Pipeline, input_data: pandas.core.frame.DataFrame)list[source]

Get predicted values for input data.

Parameters
  • trained_model (sklearn.pipeline.Pipeline) – Trained model pipeline

  • input_data (pandas.DataFrame) – Input data to predict on

Returns

array-like of predicted values

src.serialize module

Serialize and deserialize trained model pipelines.

src.serialize.load_pipeline(load_path: str)sklearn.pipeline.Pipeline[source]

Deserialize a fitted model pipeline.

Parameters

load_path (str) – Path to joblib-saved pipeline

Returns

Fitted sklearn.pipeline.Pipeline object

src.serialize.save_pipeline(pipeline: sklearn.pipeline.Pipeline, save_path: str)None[source]

Serialize a fitted model pipeline.

Parameters
  • pipeline (sklearn.pipeline.Pipeline) – Fitted model pipeline

  • save_path (str) – Where to save the pipeline

Returns

None

Module contents