src package¶
Submodules¶
src.albums_database module¶
Create and manipulate a relational database for holding album data.
-
class
src.albums_database.AlbumManager(app=None, engine_string=None)[source]¶ Bases:
objectManages Flask <-> SQLAlchemy connection and adds data to database.
-
add_album(album: str, artist: str, reviewauthor: str, score: float, releaseyear: int, reviewdate: datetime.datetime, recordlabel: str, genre: str, danceability: float, energy: float, key: float, loudness: float, speechiness: float, acousticness: float, instrumentalness: float, liveness: float, valence: float, tempo: float) → None[source]¶ Seed an existing database with additional albums.
- Parameters
album (str) – Album title
artist (str) – Artist
reviewauthor (str) – Name of reviewing author
score (float) – Pitchfork rating
releaseyear (int) – Album release year
reviewdate (str) – Album review date (%B %d %Y)
recordlabel (str) – Album’s record label(s)
genre (str) – Album genre
danceability (float) – Spotify danceability score
energy (float) – Spotify energy score
key (float) – Spotify key score
loudness (float) – Spotify loudness score
speechiness (float) – Spotify speechiness score
acousticness (float) – Spotify acousticness score
instrumentalness (float) – Spotify instrumentalness score
liveness (float) – Spotify liveness score
valence (float) – Spotify valence score
tempo (float) – Spotify tempo score
- Returns
None
-
-
class
src.albums_database.Albums(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.BaseCreate a data model for the database to capture albums.
-
acousticness¶
-
album¶
-
artist¶
-
danceability¶
-
energy¶
-
genre¶
-
id¶
-
instrumentalness¶
-
key¶
-
liveness¶
-
loudness¶
-
recordlabel¶
-
releaseyear¶
-
reviewdate¶
-
score¶
-
speechiness¶
-
tempo¶
-
valence¶
-
src.clean module¶
Clean the dataset before modeling.
-
src.clean.approximate_missing_year(data: pandas.core.frame.DataFrame, fill_column: str = 'releaseyear', approximate_with: str = 'reviewdate') → pandas.core.frame.DataFrame[source]¶ Fill missing values in one column with the year of another datetime column.
- Parameters
data (
pandas.DataFrame) – DataFrame to cleanfill_column (str) – Name of column to fill values in
approximate_with (str) – Name of datetime column to pull year from
- Returns
Cleaned
pandas.DataFrame
-
src.clean.bucket_values_together(data: pandas.core.frame.DataFrame, colname: str, values: list, replace_with: list) → pandas.core.frame.DataFrame[source]¶ Replace one or more values with a single value.
- Parameters
data (
pandas.DataFrame) – DataFrame to cleancolname (str) – Name of column to apply transformation to
values (iterable) – Iterable of values to replace.
replace_with – Value to replace with.
- Returns
Cleaned
pandas.DataFrame- Raises
TypeError – Python is simply a list of characters, this doesn’t immediately register as bad input, and logically doesn’t really make sense for this method.
-
src.clean.clean_dataset(data: pandas.core.frame.DataFrame, config) → pandas.core.frame.DataFrame[source]¶ Perform full data processing pipeline.
- Parameters
data (
pandas.DataFrame) – Raw dataconfig (dict) – Config file as read in by PyYAML
- Returns
pandas.DataFrameof cleaned data
-
src.clean.convert_datetime_to_date(data: pandas.core.frame.DataFrame, colname: str = 'reviewdate') → pandas.core.frame.DataFrame[source]¶ Remove the time component of a datetime column.
- Parameters
data (
pandas.DataFrame) – DataFrame to cleancolname (str, optional) – Name of column to apply transformation to. Defaults to “reviewdate”.
- Returns
Cleaned
pandas.DataFrame
-
src.clean.convert_str_to_datetime(data: pandas.core.frame.DataFrame, colname: str = 'reviewdate', datetime_format: str = '%B %d %Y') → pandas.core.frame.DataFrame[source]¶ Parse a string column to datetime format.
- Parameters
data (
pandas.DataFrame) – DataFrame to cleancolname (str, optional) – Name of column to apply transformation to. Defaults to “reviewdate”.
datetime_format (str, optional) – Datetime format of column. Defaults to “%B %d %Y”. For more info on these codes: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes.
- Returns
Cleaned
pandas.DataFrame
-
src.clean.fill_missing_manually(data: pandas.core.frame.DataFrame, colname: str = 'recordlabel', fill_with: tuple = ("Fool's Gold", 'Vapor', '101 Distribution', 'Jet Life', 'Espo', 'Cinematic', 'Def Jam', 'LM Dupli-Cation', 'Glory Boyz', 'Epic', 'Self-released', 'Cash Money', 'Grand Hustle', 'Vice', 'Free Bandz', 'Six Shooter Records', 'Self-released', 'Maybach', 'Self-released', 'Top Dawg', 'Triple X', '1017', 'Rostrum', 'BasedWorld', '10.Deep', 'Self-released')) → pandas.core.frame.DataFrame[source]¶ Manually fill missing values.
- Parameters
data (
pandas.DataFrame) – DataFrame to cleancolname (str, optional) – Name of column to apply transformation to. Defaults to “recordlabel”.
fill_with (iterable) – Corrected values to replace missing values with. Data type depends on the column being filled.
- Returns
Cleaned
pandas.DataFrame
-
src.clean.fill_na_with_str(data: pandas.core.frame.DataFrame, colname: str = 'genre', fill_string: str = 'Missing') → pandas.core.frame.DataFrame[source]¶ Fill NA values with a string value.
- Parameters
data (
pandas.DataFrame) – DataFrame to cleancolname (str, optional) – Name of column to apply transformation to. Defaults to “genre”.
fill_string (str, optional) – String to replace missing values with. Defaults to “Missing”.
- Returns
Cleaned
pandas.DataFrame
-
src.clean.strip_whitespace(data: pandas.core.frame.DataFrame, colname: str = 'recordlabel') → pandas.core.frame.DataFrame[source]¶ Trim extra whitespace from values in a column.
- Parameters
data (
pandas.DataFrame) – DataFrame to cleancolname (str, optional) – Name of column to apply transformation to. Defaults to “recordlabel”.
- Returns
Cleaned
pandas.DataFrame
src.evaluate_performance module¶
Evaluate the performance of a model through its predictions.
-
src.evaluate_performance.evaluate_model(results_data: pandas.core.frame.DataFrame, y_true_colname: str, y_pred_colname: str) → pandas.core.frame.DataFrame[source]¶ Evaluate performance against a variety of regression metrics:
MSE
RMSE
MAD
R-squared
Max error
- Parameters
results_data (
pandas.DataFrame) – DataFrame containing (at least) predicted and ground truth valuesy_true_colname (str) – Name of column containing true values
y_pred_colname (str) – Name of column containing predicted values
- Returns
pandas.DataFramecontaining metrics and values
src.load_data module¶
Move data between a local filesystem and S3 bucket.
Copyright 2020, Chloe Mawer
-
src.load_data.download_file_from_s3(local_path: str, s3path: str) → None[source]¶ Download a file from S3.
- Parameters
local_path (str) – Destination file or path on local machine
s3path (str) – File or path to download from S3
- Returns
None
-
src.load_data.download_from_s3_pandas(local_path: str, s3path: str, sep: str = ',') → None[source]¶ Download a pandas.DataFrame from S3.
- Parameters
local_path (str) – Destination file or path on local machine
s3path (str) – File or path to download from S3
sep (str, optional) – Field separator in S3 file. Defaults to “,”.
- Returns
None
-
src.load_data.download_raw_data(local_destination: str) → None[source]¶ Download the original dataset from source.
- Parameters
local_destination (str) – Destination file or path on local machine
- Returns
None
-
src.load_data.parse_s3(s3path: str) → Tuple[str, str][source]¶ Split an S3 filepath into the bucket name and subsequent path.
- Parameters
s3path (str) – File path in S3 (including “s3://” prefix)
- Returns
Tuple containing S3 bucket name and S3 path
- Return type
tuple(str, str)
- Raises
ValueError – //bucket/path”
-
src.load_data.upload_file_to_s3(local_path: str, s3path: str) → None[source]¶ Upload a local file to S3.
- Parameters
local_path (str) – File name or path to local file to upload.
s3path (str) – Destination path in S3.
- Returns
None
-
src.load_data.upload_to_s3_pandas(local_path: str, s3path: str, sep: str = ',') → None[source]¶ Upload a pandas.DataFrame to S3.
- Parameters
local_path (str) – File name or path to local file to upload.
s3path (str) – Destination path in S3.
sep (str, optional) – Field separator. Defaults to “,”.
- Returns
None
src.model module¶
Build, fit, and evaluate predictive models.
-
src.model.make_model(**kwargs) → sklearn.base.BaseEstimator[source]¶ Create an untrained GBT model for use in a sklearn.pipeline.Pipeline.
- Parameters
**kwargs – Parameters to pass on to GBT constructor
- Returns
Untrained
sklearn.ensemble.GradientBoostingRegressorobject
-
src.model.make_preprocessor(numeric_features: List[str], categorical_features: List[str], handle_unknown: str) → sklearn.compose._column_transformer.ColumnTransformer[source]¶ Define preprocessing steps for input features.
Performs standard scaling for numeric features and one-hot encoding for categorical features. All features specified for this function are processed, and _only_ these features are used when modeling. In other words, this preprocessor determines the exact input columns (and order) when training and performing inference.
- Parameters
numeric_features (list(str)) – Names of numeric features to scale
categorical_features (list(str)) – Names of categorical features to one-hot encode
handle_unknown (str) – Policy for unknown categories in OneHotEncoder (either “handle_unknown” or “error”)
- Returns
A
sklearn.compose.ColumnTransformerwith the desired transformation steps
-
src.model.parse_dict_to_dataframe(form_dict: dict) → pandas.core.frame.DataFrame[source]¶ Parse a dictionary to pandas.DataFrame format.
Flask forms supply data via POST requests in MultiDict format, but the model pipeline requires an input DataFrame.
- Parameters
form_dict (dict) – Flask form response as a flat dictionary
- Returns
pandas.DataFramewith keys as column names and values as theassociated values for each key
-
src.model.split_predictors_response(data: pandas.core.frame.DataFrame, target_col: str = 'score') → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶ Separate predictor variables from response variable.
-
src.model.split_train_val_test(features: pandas.core.frame.DataFrame, target: list, train_val_test_ratio: str, **kwargs) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶ Partition dataset into training, validation, and testing splits.
- Parameters
features (
pandas.DataFrame) – DataFrame of input featurestarget (array-like) – Values of response variable to predict
train_val_test_ratio (str) – Relative proportion of data for each of train, val, and test sets, in the form “X:Y:Z” (e.g., “6:2:2”).
**kwargs – Additional settings to pass on to train_test_split() (for example, random seed)
- Returns
- (X_train, X_val, X_test, y_train, y_val, y_test), each as DataFrames.
X_val and y_val are omitted if the desired ratio does not specify the size of a validation set.
-
src.model.train_pipeline(X_train: pandas.core.frame.DataFrame, y_train: list, preprocessor: sklearn.compose._column_transformer.ColumnTransformer, model: sklearn.base.BaseEstimator) → sklearn.pipeline.Pipeline[source]¶ Create and fit a preprocessing –> modeling pipeline.
- Parameters
X_train (
pandas.DataFrame) – Training featuresy_train (array-like) – Training targets
(obj (preprocessor) – sklearn.compose.ColumnTransformer): ColumnTransformer defining the processing to perform for input data
model (
sklearn.base.BaseEstimator) – An untrained sklearn regression model
- Returns
A fitted
sklearn.pipeline.Pipeline
-
src.model.validate_dataframe(data: pandas.core.frame.DataFrame, output_cols: List = ['artist', 'album', 'reviewauthor', 'releaseyear', 'reviewdate', 'recordlabel', 'genre', 'danceability', 'energy', 'key', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']) → pandas.core.frame.DataFrame[source]¶ Align a DataFrame with model pipeline’s required order and names.
The model pipeline requires an input DataFrame with exactly the same columns as seen during training, and in the same order. Creates the columns that don’t exist (filling with NA).
- Parameters
data (
pandas.DataFrame) – Input DataFrame to validate/alignoutput_cols (list(str), optional) – Required columns for output DataFrame. Defaults to those seen during training. If not provided (None), no adjustment to the DataFrame’s columns is made.
- Returns
Validated
pandas.DataFrame
src.post_process module¶
Analyze a trained model.
-
src.post_process.get_feature_importance(trained_pipeline: sklearn.pipeline.Pipeline, numeric_features: List[str]) → pandas.core.series.Series[source]¶ Get feature importance measures from a trained model.
- Parameters
trained_pipeline (
sklearn.pipeline.Pipeline) – Fitted model pipelinenumeric_features (list(str)) – Names of numeric features
- Returns
pandas.Seriescontaining each feature and its importance
src.score_model module¶
Generate new values given a trained model and some new input.
-
src.score_model.append_predictions(trained_model: sklearn.pipeline.Pipeline, input_data: pandas.core.frame.DataFrame, output_col: str = 'preds') → pandas.core.frame.DataFrame[source]¶ Append predictions to an existing input DataFrame.
- Parameters
trained_model (
sklearn.pipeline.Pipeline) – Trained model pipelineinput_data (
pandas.DataFrame) – Input data to predict onoutput_col (str, optional) – Name of column to place predicted values in. Defaults to “preds”.
- Returns
Input pandas.DataFrame with predictions appended as a new column
-
src.score_model.get_predictions(trained_model: sklearn.pipeline.Pipeline, input_data: pandas.core.frame.DataFrame) → list[source]¶ Get predicted values for input data.
- Parameters
trained_model (
sklearn.pipeline.Pipeline) – Trained model pipelineinput_data (
pandas.DataFrame) – Input data to predict on
- Returns
array-like of predicted values
src.serialize module¶
Serialize and deserialize trained model pipelines.