Glue:Resource:aws_glue_crawler

类型

Resource

标签

aws_glue_crawler

AWS服务

Glue

说明

主要用于创建Glue 爬网程序。

样例

数据在DynamoDB中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  dynamodb_target {
    path = "table-name"
  }
}

数据在数据库中存储,通过JDBC的方式进行元数据爬取

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  jdbc_target {
    connection_name = aws_glue_connection.example.name
    path            = "database-name/%"
  }
}

数据在S3中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  s3_target {
    path = "s3://${aws_s3_bucket.example.bucket}"
  }
}

数据在Catalog中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  catalog_target {
    database_name = aws_glue_catalog_database.example.name
    tables        = [aws_glue_catalog_table.example.name]
  }

  schema_change_policy {
    delete_behavior = "LOG"
  }

  configuration = <<EOF
{
  "Version":1.0,
  "Grouping": {
    "TableGroupingPolicy": "CombineCompatibleSchemas"
  }
}
EOF
}

数据在MongoDB中存储

resource "aws_glue_crawler" "example" {
  database_name = aws_glue_catalog_database.example.name
  name          = "example"
  role          = aws_iam_role.example.arn

  mongodb_target {
    connection_name = aws_glue_connection.example.name
    path            = "database-name/%"
  }
}

爬网程序的配置设置样例

resource "aws_glue_crawler" "events_crawler" {
  database_name = aws_glue_catalog_database.glue_database.name
  schedule      = "cron(0 1 * * ? *)"
  name          = "events_crawler_${var.environment_name}"
  role          = aws_iam_role.glue_role.arn
  tags          = var.tags

  configuration = jsonencode(
    {
      Grouping = {
        TableGroupingPolicy = "CombineCompatibleSchemas"
      }
      CrawlerOutput = {
        Partitions = { AddOrUpdateBehavior = "InheritFromTable" }
      }
      Version = 1
    }
  )

  s3_target {
    path = "s3://${aws_s3_bucket.data_lake_bucket.bucket}"
  }
}

参数(待翻译)

Note:Must specify at least one of dynamodb_target, jdbc_target, s3_target or catalog_target

  • database_name (Required) Glue database where results are written.
  • name (Required) Name of the crawler.
  • role (Required) The IAM role friendly name (including path without leading slash), or ARN of an IAM role, used by the crawler to access other resources.
  • classifiers (Optional) List of custom classifiers. By default, all AWS classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.
  • configuration (Optional) JSON string of configuration information. For more details see Setting Crawler Configuration Options.
  • description (Optional) Description of the crawler.
  • dynamodb_target (Optional) List of nested DynamoDB target arguments. See Dynamodb Target below.
  • jdbc_target (Optional) List of nested JBDC target arguments. See JDBC Target below.
  • s3_target (Optional) List nested Amazon S3 target arguments. See S3 Target below.
  • mongodb_target (Optional) List nested MongoDB target arguments. See MongoDB Target below.
  • schedule (Optional) A cron expression used to specify the schedule. For more information, see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
  • schema_change_policy (Optional) Policy for the crawler's update and deletion behavior. See Schema Change Policy below.
  • lineage_configuration (Optional) Specifies data lineage configuration settings for the crawler. See Lineage Configuration below.
  • recrawl_policy (Optional) A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.. See Recrawl Policy below.
  • security_configuration (Optional) The name of Security Configuration to be used by the crawler
  • table_prefix (Optional) The table prefix used for catalog tables that are created.
  • tags - (Optional) Key-value map of resource tags. If configured with a provider default_tags configuration block present, tags with matching keys will overwrite those defined at the provider-level.

Dynamodb Target

  • path - (Required) The name of the DynamoDB table to crawl.
  • scan_all - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. defaults to true.
  • scan_rate - (Optional) The percentage of the configured read capacity units to use by the AWS Glue crawler. The valid values are null or a value between 0.1 to 1.5.

JDBC Target

  • connection_name - (Required) The name of the connection to use to connect to the JDBC target.
  • path - (Required) The path of the JDBC target.
  • exclusions - (Optional) A list of glob patterns used to exclude from the crawl.

S3 Target

  • path - (Required) The path to the Amazon S3 target.
  • connection_name - (Optional) The name of a connection which allows crawler to access data in S3 within a VPC.
  • exclusions - (Optional) A list of glob patterns used to exclude from the crawl.
  • sample_size - (Optional) Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all the files are crawled. A valid value is an integer between 1 and 249.
  • event_queue_arn - (Optional) The ARN of the SQS queue to receive S3 notifications from.
  • dlq_event_queue_arn - (Optional) The ARN of the dead-letter SQS queue.

Catalog Target

  • database_name - (Required) The name of the Glue database to be synchronized.
  • tables - (Required) A list of catalog tables to be synchronized.

Note:deletion_behavior of catalog target doesn't support DEPRECATE_IN_DATABASE

Note:configuration for catalog target crawlers will have { ... "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas"} } by default.

MongoDB Target

  • connection_name - (Required) The name of the connection to use to connect to the Amazon DocumentDB or MongoDB target.
  • path - (Required) The path of the Amazon DocumentDB or MongoDB target (database/collection).
  • scan_all - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. Default value is true.

Schema Change Policy

  • delete_behavior - (Optional) The deletion behavior when the crawler finds a deleted object. Valid values: LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE. Defaults to DEPRECATE_IN_DATABASE.
  • update_behavior - (Optional) The update behavior when the crawler finds a changed schema. Valid values: LOG or UPDATE_IN_DATABASE. Defaults to UPDATE_IN_DATABASE.

Lineage Configuration

  • crawler_lineage_settings - (Optional) Specifies whether data lineage is enabled for the crawler. Valid values are: ENABLE and DISABLE. Default value is Disable.

Recrawl Policy

  • recrawl_behavior - (Optional) Specifies whether to crawl the entire dataset again or to crawl only folders that were added since the last crawler run. Valid Values are: CRAWL_EVERYTHING and CRAWL_NEW_FOLDERS_ONLY. Default value is CRAWL_EVERYTHING.

属性引用

  • id - Crawler name
  • arn - The ARN of the crawler
  • tags_all - A map of tags assigned to the resource, including those inherited from the provider default_tags configuration block.

资源导入

替换如下命令行中的参数${crawler_job}参数并运行,可以导入Glue Crawler:

 

$ terraform import aws_glue_crawler.${crawler_job} ${crawler_job}

 

 

 

 

Glue:Resource:aws_glue_crawler

 

上一篇:Spring Cloud Gateway 内置的过滤器工厂


下一篇:gateway-GatewayFilter Factory 过滤器工厂