Data Reception and Understanding

Data Source Analysis πŸ“Š

Understanding the variety of input formats is crucial for automation success. This chapter details our approach to handling diverse data sources.

Format Distribution πŸ“ˆ

Format Frequency Challenges
Excel 97-2003 30 Legacy format, limited features
Excel 2007+ 45 Multiple sheets, complex formatting
CSV 25 Encoding issues, delimiter variations

Automated Reception System πŸ€–

System Architecture πŸ—οΈ

flowchart LR
    subgraph Raw Data
        RD[Raw Data Files]
    end

    subgraph Database
        PG[(DuckDB Database)]
        CT[(Control Table)]
    end

    subgraph Data Transformation
        R[R Scripts]
    end

    subgraph Layers
        BZ[Bronze Layer]
        SL[Silver Layer]
        GL[Gold Layer]
    end

    subgraph Analysis & Visualization
        RS[RStudio]
        SH[Shiny Dashboard]
        MT[Monitoring Dashboard]
    end

    %% Data Ingestion and Logging
    RD -->|Load & Log| CT
    RD -->|Load| BZ

    %% Data Transformation with R
    BZ -->|Process with R| SL
    SL -->|Process with R| GL

    %% Status Updates to Control Table
    BZ & SL & GL -->|Status Updates| CT

    %% Storing Processed Data
    GL -->|Store| PG

    %% Analysis and Visualization
    PG -->|Query Data| RS
    PG -->|Query Data| SH
    RS -->|Analysis Results| SH

    %% Monitoring and Reporting
    CT -->|Processing Status| MT
    CT -->|Pipeline Metrics| SH

    %% Styling Nodes
    style CT fill:#f9f,stroke:#333,stroke-width:4px
    style BZ fill:#ffa07a
    style SL fill:#98fb98
    style GL fill:#87cefa

Explanation of the Diagram

1. Raw Data Ingestion

  • RD[Raw Data Files]: Represents the raw data files received from various sources.
  • CT[(Control Table)]: A table in DuckDB used to log and track the status of data processing.
  • Process:
    • Raw data files (RD) are loaded into the Bronze Layer (BZ).
    • Each file loading is logged in the Control Table (CT) with metadata such as file name, load time, and initial status.

2. Data Transformation with R

  • R[R Scripts]: R scripts are responsible for data transformation between layers.
  • Layers:
    • BZ[Bronze Layer]: Contains raw data as it was received, with minimal processing.
    • SL[Silver Layer]: Data is cleaned and validated here.
    • GL[Gold Layer]: Data is transformed into the final, analytics-ready format.
  • Process:
    • BZ –> SL: The R scripts read data from the Bronze Layer, perform cleaning and validation, and write the processed data to the Silver Layer.
    • SL –> GL: The R scripts further transform the data from the Silver Layer to the Gold Layer, applying business logic and aggregations.

3. Status Updates to the Control Table

  • Each transformation step updates the Control Table (CT) with the current processing status, step name, any error messages, and timestamps.
  • This allows for real-time monitoring of the data pipeline.

4. Storing Processed Data

  • GL[Gold Layer] –> PG[(DuckDB Database)]: The final transformed data from the Gold Layer is stored in the DuckDB database (PG).

5. Analysis & Visualization

  • PG[(DuckDB Database)]: Acts as the source of truth for analytical queries.
  • RS[RStudio]: Analysts use RStudio to query the database and perform data analysis.
  • SH[Shiny Dashboard]: Interactive dashboards are built using Shiny to visualize data and share insights.
  • Process:
    • PG –> RS: Data scientists and analysts query the data using RStudio for in-depth analysis.
    • RS –> SH: Results and analyses from RStudio are fed into Shiny dashboards for visualization.
    • PG –> SH: Shiny dashboards can also query the database directly for real-time data visualization.

6. Monitoring & Reporting

  • CT[(Control Table)] –> MT[Monitoring Dashboard]: The control table provides processing status updates to the monitoring dashboard, enabling oversight of the pipeline’s health.
  • CT –> SH: Pipeline metrics and processing statistics are available in the Shiny dashboard for reporting purposes.

7. Styling Notes

  • Control Table (CT) is highlighted with a magenta fill and a bold border to emphasize its central role in monitoring.
  • Bronze (BZ), Silver (SL), and Gold (GL) layers are color-coded for clarity:
    • Bronze Layer: Light salmon color.
    • Silver Layer: Pale green color.
    • Gold Layer: Light blue color.

Key Takeaways

  • Entirely R-Based Pipeline: All data processing and transformations are performed using R scripts without the use of dbt.
  • Control Table Integration: The control table is central to tracking the status and progress of data through each layer.
  • Layered Architecture:
    • Bronze Layer: Raw data ingestion.
    • Silver Layer: Data cleaning and validation.
    • Gold Layer: Final data transformations for analysis.
  • Database Storage: The DuckDB database stores the processed data, making it accessible for querying and analysis.
  • Analysis Tools:
    • RStudio: Used by analysts for data exploration and modeling.
    • Shiny Dashboard: Provides interactive visualizations and shares insights with stakeholders.
  • Monitoring: The monitoring dashboard receives real-time updates on the pipeline’s processing status from the control table.

Next Steps πŸ“‹

  • Implement the Control Table: Ensure the control table is created in your DuckDB database and integrated into your R scripts.
  • Develop R Scripts for Each Layer:
    • Bronze to Silver: Write scripts to clean and validate data.
    • Silver to Gold: Write scripts to apply business logic and prepare data for analysis.
  • Set Up Analysis Environment:
    • Configure RStudio connections to your DuckDB database.
    • Develop Shiny dashboards to visualize data and pipeline metrics.
  • Monitoring:
    • Create the monitoring dashboard to display processing statuses and alert on failures or bottlenecks.
  • ➑️ Proceed to Data Preparation
  • πŸ”™ Back to Introduction
  • πŸ“š View Technical Documentation