Glue:Resource:aws_glue_crawler

2023-11-15 19:38:16

类型

Resource

AWS服务

Glue

说明

主要用于创建Glue 爬网程序。

样例

数据在DynamoDB中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  dynamodb_target {
    path = "table-name"
  }
}

数据在数据库中存储，通过JDBC的方式进行元数据爬取

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  jdbc_target {
    connection_name = aws_glue_connection.example.name
    path            = "database-name/%"
  }
}

数据在S3中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  s3_target {
    path = "s3://${aws_s3_bucket.example.bucket}"
  }
}

数据在Catalog中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  catalog_target {
    database_name = aws_glue_catalog_database.example.name
    tables        = [aws_glue_catalog_table.example.name]
  }

  schema_change_policy {
    delete_behavior = "LOG"
  }

  configuration = <<EOF
{
  "Version":1.0,
  "Grouping": {
    "TableGroupingPolicy": "CombineCompatibleSchemas"
  }
}
EOF
}

数据在MongoDB中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  mongodb_target {
    connection_name = aws_glue_connection.example.name
    path            = "database-name/%"
  }
}

爬网程序的配置设置样例

resource "aws_glue_crawler" "events_crawler" {
  database_name = aws_glue_catalog_database.glue_database.name
  schedule      = "cron(0 1 * * ? *)"
  name          = "events_crawler_${var.environment_name}"
  role          = aws_iam_role.glue_role.arn
  tags          = var.tags

  configuration = jsonencode(
    {
      Grouping = {
        TableGroupingPolicy = "CombineCompatibleSchemas"
      }
      CrawlerOutput = {
        Partitions = { AddOrUpdateBehavior = "InheritFromTable" }
      }
      Version = 1
    }
  )

  s3_target {
    path = "s3://${aws_s3_bucket.data_lake_bucket.bucket}"
  }
}

参数（待翻译）

Note：Must specify at least one of dynamodb_target, jdbc_target, s3_target or catalog_target

database_name (Required) Glue database where results are written.
name (Required) Name of the crawler.
role (Required) The IAM role friendly name (including path without leading slash), or ARN of an IAM role, used by the crawler to access other resources.
classifiers (Optional) List of custom classifiers. By default, all AWS classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.
configuration (Optional) JSON string of configuration information. For more details see Setting Crawler Configuration Options.
description (Optional) Description of the crawler.
dynamodb_target (Optional) List of nested DynamoDB target arguments. See Dynamodb Target below.
jdbc_target (Optional) List of nested JBDC target arguments. See JDBC Target below.
s3_target (Optional) List nested Amazon S3 target arguments. See S3 Target below.
mongodb_target (Optional) List nested MongoDB target arguments. See MongoDB Target below.
schedule (Optional) A cron expression used to specify the schedule. For more information, see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
schema_change_policy (Optional) Policy for the crawler's update and deletion behavior. See Schema Change Policy below.
lineage_configuration (Optional) Specifies data lineage configuration settings for the crawler. See Lineage Configuration below.
recrawl_policy (Optional) A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.. See Recrawl Policy below.
security_configuration (Optional) The name of Security Configuration to be used by the crawler
table_prefix (Optional) The table prefix used for catalog tables that are created.
tags - (Optional) Key-value map of resource tags. If configured with a provider default_tags configuration block present, tags with matching keys will overwrite those defined at the provider-level.

Dynamodb Target

path - (Required) The name of the DynamoDB table to crawl.
scan_all - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. defaults to true.
scan_rate - (Optional) The percentage of the configured read capacity units to use by the AWS Glue crawler. The valid values are null or a value between 0.1 to 1.5.

JDBC Target

connection_name - (Required) The name of the connection to use to connect to the JDBC target.
path - (Required) The path of the JDBC target.
exclusions - (Optional) A list of glob patterns used to exclude from the crawl.

S3 Target

path - (Required) The path to the Amazon S3 target.
connection_name - (Optional) The name of a connection which allows crawler to access data in S3 within a VPC.
exclusions - (Optional) A list of glob patterns used to exclude from the crawl.
sample_size - (Optional) Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all the files are crawled. A valid value is an integer between 1 and 249.
event_queue_arn - (Optional) The ARN of the SQS queue to receive S3 notifications from.
dlq_event_queue_arn - (Optional) The ARN of the dead-letter SQS queue.

Catalog Target

database_name - (Required) The name of the Glue database to be synchronized.
tables - (Required) A list of catalog tables to be synchronized.

Note：deletion_behavior of catalog target doesn't support DEPRECATE_IN_DATABASE

Note：configuration for catalog target crawlers will have { ... "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas"} } by default.

MongoDB Target

connection_name - (Required) The name of the connection to use to connect to the Amazon DocumentDB or MongoDB target.
path - (Required) The path of the Amazon DocumentDB or MongoDB target (database/collection).
scan_all - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. Default value is true.

Schema Change Policy

delete_behavior - (Optional) The deletion behavior when the crawler finds a deleted object. Valid values: LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE. Defaults to DEPRECATE_IN_DATABASE.
update_behavior - (Optional) The update behavior when the crawler finds a changed schema. Valid values: LOG or UPDATE_IN_DATABASE. Defaults to UPDATE_IN_DATABASE.

Lineage Configuration

crawler_lineage_settings - (Optional) Specifies whether data lineage is enabled for the crawler. Valid values are: ENABLE and DISABLE. Default value is Disable.

Recrawl Policy

recrawl_behavior - (Optional) Specifies whether to crawl the entire dataset again or to crawl only folders that were added since the last crawler run. Valid Values are: CRAWL_EVERYTHING and CRAWL_NEW_FOLDERS_ONLY. Default value is CRAWL_EVERYTHING.

属性引用

id - Crawler name
arn - The ARN of the crawler
tags_all - A map of tags assigned to the resource, including those inherited from the provider default_tags configuration block.

资源导入

替换如下命令行中的参数${crawler_job}参数并运行，可以导入Glue Crawler：

$ terraform import aws_glue_crawler.${crawler_job} ${crawler_job}

码农公寓

Glue:Resource:aws_glue_crawler

类型

标签

AWS服务

说明

样例

参数（待翻译）

属性引用

资源导入

码农公寓

类型

标签

AWS服务

说明

样例

参数（待翻译）

属性引用

资源导入

相关文章