/
Data Lake - Step-by-Step SQL Scripts

Data Lake - Step-by-Step SQL Scripts

Step By Step Whole Process Step by step what needs to be done to load data into snowflake. If a step is already completed, you can skip it. 1. **Create a Data Warehouse:** To create a data warehouse, you can use the `CREATE WAREHOUSE` command. Replace `<your_warehouse_name>` with your desired warehouse name. CREATE WAREHOUSE <your_warehouse_name> WAREHOUSE_SIZE = 'X-SMALL' -- Specify the warehouse size AUTO_SUSPEND = 600 -- Specify auto-suspend time in seconds AUTO_RESUME = TRUE; -- Enable auto-resume ``` 2. **Create a Database:** To create a database within the data warehouse, use the `CREATE DATABASE` command. Replace `<your_database_name>` with your desired database name. CREATE DATABASE <your_database_name>; 3. **Create a Schema** After creating a database, you can create one or more schemas within it using the `CREATE SCHEMA` command. Replace `<your_schema_name>` with your desired schema name. A schema is a way to organize tables within a DB. CREATE SCHEMA <your_schema_name>; 4. **Create Tables:** To create tables within a schema, use the `CREATE TABLE` command. Replace `<your_table_name>` with your desired table name, and define the columns and data types as needed. CREATE TABLE <your_schema_name>.<your_table_name> ( column1_name data_type, column2_name data_type, ); Below are some of the basic data types supported by Snowflake. Additionally, Snowflake allows you to create tables with columns of different data types to suit your specific data modeling needs. Data Type "VARCHAR(16777216)" can be used as a data type for all columns, but we recommend selecting the specific data type for each column. For advanced data types, such as TIMEZONE, INTERVAL, and user-defined data types please refer to Snowflake documentation and research so you can define these fields best for your needs. **Numeric Types:** - `NUMBER` or `NUMERIC`: Represents fixed or floating-point numbers with user-defined precision and scale. - `INTEGER`: Represents whole numbers (integers) without decimal places. - `FLOAT` or `DOUBLE`: Represents floating-point numbers with single or double precision. **Character String Types:** - `VARCHAR`: Represents variable-length character strings. - `CHAR`: Represents fixed-length character strings. - `STRING`: An alias for `VARCHAR`, commonly used for compatibility. **Binary Types:** - `BINARY`: Represents binary data or byte strings. - `VARIANT`: Represents semi-structured data, such as JSON or XML. **Date and Time Types:** - `DATE`: Represents a date without a time component (e.g., '2023-06-21'). - `TIME`: Represents a time of day without a date component (e.g., '14:30:00'). - `TIMESTAMP`: Represents a date and time with fractional seconds (e.g., '2023-06-21 14:30:00.123'). - `TIMEZONE`: Represents a timestamp with an associated time zone. **Boolean Type:** - `BOOLEAN`: Represents a boolean value (either `TRUE` or `FALSE`). **Array Types:** - `ARRAY`: Represents an ordered collection of values of the same data type. **Object Types:** - `OBJECT`: Represents an ordered collection of key-value pairs, similar to JSON. **Geospatial Types:** - `GEOMETRY`: Represents geometric objects (e.g., points, polygons). **Variant Types:** - `SEMI-STRUCTURED`: Used for storing semi-structured data like JSON or XML. **UUID Type:** - `UUID`: Represents a universally unique identifier. 5. **Create File Formats:** File formats define how Snowflake should interpret data files during loading. Use the `CREATE FILE FORMAT` command to define a file format. Replace `<your_format_name>` with your desired format name. CREATE OR REPLACE FILE FORMAT <your_format_name> TYPE = 'CSV' -- Specify the file format type { CSV | JSON | AVRO | ORC | PARQUET | XML } FIELD_OPTIONALLY_ENCLOSED_BY = '' -- This option is used for text-based file formats like CSV. It specifies the character that encloses string fields when they contain special characters, or it can be left empty if not needed. SKIP_HEADER = 1 -- Specify the number of header lines to skip SKIP_BLANK_LINES = TRUE; 6. **Create Stage** In Snowflake, a data loading stage is used to define a location where you can load data into or unload data from Snowflake. There are two types of stages: internal and external. Internal stages are managed by Snowflake, while external stages are hosted on cloud storage platforms like Amazon S3, Azure Blob Storage, or Google Cloud Storage. Internal Stages are best used for manual uploads from local machine. Methods for uploading data from on-premise locations are: 1) Manual Upload through Snwosight UI 2) Download and Connect SnowSQL Client to connect to on-premise storage locations - SnowSQL is a PowerShell/Command Line application which can be downloaded from Snwoflake Documentation. Instructions to connect are within Snowflake Dcumentation 3) Set up Extenral SFTP file locations for Snowflake to connect and pull files from server 4) Invest in a tool through Snwoflake Partners (FiveTran, Azure Data Factory, etc) - Trellance can provide Azure Data Factory licensing as a Microsoft MSP or you may take advantage of the Snowflake Partner Connect features to utilize Snowflake Partnerships External Stages are best for SFTP loads and loading from the cloud platforms mentioned above. Both Internal and External stages can be created by using the commands below or through the Snowsight UI Internal Stage CREATE OR REPLACE STAGE your_internal_stage; --your_external_stage: Replace with your desired stage name. -- Optionally, specify the storage integration if you're using one ALTER STAGE your_internal_stage SET STORAGE_INTEGRATION = your_integration; External Stages -- Create an external stage for SFTP CREATE OR REPLACE STAGE sftp_stage URL = 'sftp://your-sftp-server' CREDENTIALS = ( USERNAME = 'your-sftp-username', PASSWORD = 'your-sftp-password' ); For other External Stages such as Cloud Storage, please refer to documentation 7. ** Load Data Into Tables** 7a. ** Manual or One-Time Upload** The Snowsight UI can be used to load data into tables or Use the COPY INTO command to load data into tables COPY INTO your_target_table FROM @your_stage/data_folder/ FILE_FORMAT = (FORMAT_NAME = your_file_format); Here's a breakdown of each part of the command: - `<target_table>`: Replace this with the name of the table where you want to load the data. - `@<stage_name>/<path_to_data>`: Specify the stage where your data is located. `<stage_name>` is the name of your stage, and `<path_to_data>` is the path within the stage where your data files are stored. (this will be used if uploading files from a local machine) - `FILE_FORMAT = (FORMAT_NAME = <file_format_name>)`: If you have a specific file format defined for your data files in the stage, you can specify it here. This is optional but can help Snowflake correctly interpret the data in the files. Ensure that: 1. Your stage contains the data files you want to load into the table. 2. The structure of the data files matches the structure of the target table in terms of columns and data types. 3. You have the necessary permissions to execute the `COPY INTO` command and insert data into the target table. 4. If you're using an external stage, you've configured the appropriate storage integration and credentials to access the stage. When you run the `COPY INTO` command, Snowflake will copy the data from the specified stage into the target table. 7b. ** Set up Automated Data Upload** There are 2 ways to automate data upload. One is through Snowpipe and the other is via a Task. In summary, Snowflake Tasks offer greater flexibility and are suitable for scenarios where you need to perform custom data processing, scheduling, and more complex operations. Tasks are more suitable for SFTP sources and stages. Snowpipe, on the other hand, is best suited for simplified, event-driven data loading with low latency and is best used with ingesting data from Cloud Platforms. The choice between them depends on your specific use case and requirements. It's also possible to use both in combination to cover a broader range of data integration needs. *Using Tasks to automate data upload* CREATE OR REPLACE TASK your_upload_task WAREHOUSE = your_warehouse SCHEDULE = 'YOUR_SCHEDULE' -- Define the schedule (e.g., '5 MINUTE', 'HOURLY', 'DAILY', etc.) AS -- Use the COPY INTO command to load data from the stage into the table COPY INTO your_target_table FROM @your_internal_stage/or SFTP external stage FILE_FORMAT = (FORMAT_NAME = your_file_format); Enable the task and set up the schedule using the ALTER TASK command: ALTER TASK your_upload_task RESUME; ________________________ *Using Snowpipe* Once the data is in the cloud storage platform, create a Snowpipe integration and an external stage to access the data. -- Create a Snowpipe integration CREATE OR REPLACE STORAGE INTEGRATION your_integration TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = <storage_provider> -- Replace with the appropriate cloud storage provider ENABLED = TRUE; -- Create an external stage CREATE OR REPLACE STAGE your_external_stage URL = '<cloud_storage_url>' CREDENTIALS = ( STORAGE_INTEGRATION = your_integration ); <storage_provider>: Replace with your cloud storage provider (e.g., AWS_S3, AZURE_BLOB_STORAGE). <cloud_storage_url>: Replace with the URL of your cloud storage location where the data was transferred. Create a Snowpipe Pipe: Create a Snowpipe pipe that specifies the external stage, file format, and the target table in Snowflake where data should be loaded. -- Create a Snowpipe pipe CREATE OR REPLACE PIPE your_pipe AUTO_INGEST = TRUE -- Set to TRUE to enable automatic data ingestion INTEGRATION = your_integration AS COPY INTO your_target_table -- Replace with the name of your target table FROM @your_external_stage/ -- Replace with the path to your external stage FILE_FORMAT = (FORMAT_NAME = your_file_format); -- Optional: Specify the file format

 

Related content