类型
Resource
标签
aws_glue_crawler
AWS服务
Glue
说明
主要用于创建Glue 爬网程序。
样例
数据在DynamoDB中存储
resource "aws_glue_crawler" "example" { database_name = aws_glue_catalog_database.example.name name = "example" role = aws_iam_role.example.arn dynamodb_target { path = "table-name" } }
数据在数据库中存储,通过JDBC的方式进行元数据爬取
resource "aws_glue_crawler" "example" { database_name = aws_glue_catalog_database.example.name name = "example" role = aws_iam_role.example.arn jdbc_target { connection_name = aws_glue_connection.example.name path = "database-name/%" } }
数据在S3中存储
resource "aws_glue_crawler" "example" { database_name = aws_glue_catalog_database.example.name name = "example" role = aws_iam_role.example.arn s3_target { path = "s3://${aws_s3_bucket.example.bucket}" } }
数据在Catalog中存储
resource "aws_glue_crawler" "example" { database_name = aws_glue_catalog_database.example.name name = "example" role = aws_iam_role.example.arn catalog_target { database_name = aws_glue_catalog_database.example.name tables = [aws_glue_catalog_table.example.name] } schema_change_policy { delete_behavior = "LOG" } configuration = <<EOF { "Version":1.0, "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" } } EOF }
数据在MongoDB中存储
resource "aws_glue_crawler" "example" { database_name = aws_glue_catalog_database.example.name name = "example" role = aws_iam_role.example.arn mongodb_target { connection_name = aws_glue_connection.example.name path = "database-name/%" } }
爬网程序的配置设置样例
resource "aws_glue_crawler" "events_crawler" { database_name = aws_glue_catalog_database.glue_database.name schedule = "cron(0 1 * * ? *)" name = "events_crawler_${var.environment_name}" role = aws_iam_role.glue_role.arn tags = var.tags configuration = jsonencode( { Grouping = { TableGroupingPolicy = "CombineCompatibleSchemas" } CrawlerOutput = { Partitions = { AddOrUpdateBehavior = "InheritFromTable" } } Version = 1 } ) s3_target { path = "s3://${aws_s3_bucket.data_lake_bucket.bucket}" } }
参数(待翻译)
Note:Must specify at least one of dynamodb_target, jdbc_target, s3_target or catalog_target
- database_name (Required) Glue database where results are written.
- name (Required) Name of the crawler.
- role (Required) The IAM role friendly name (including path without leading slash), or ARN of an IAM role, used by the crawler to access other resources.
- classifiers (Optional) List of custom classifiers. By default, all AWS classifiers are included in a crawl, but these custom classifiers always override the default classifiers for a given classification.
- configuration (Optional) JSON string of configuration information. For more details see Setting Crawler Configuration Options.
- description (Optional) Description of the crawler.
- dynamodb_target (Optional) List of nested DynamoDB target arguments. See Dynamodb Target below.
- jdbc_target (Optional) List of nested JBDC target arguments. See JDBC Target below.
- s3_target (Optional) List nested Amazon S3 target arguments. See S3 Target below.
- mongodb_target (Optional) List nested MongoDB target arguments. See MongoDB Target below.
- schedule (Optional) A cron expression used to specify the schedule. For more information, see Time-Based Schedules for Jobs and Crawlers. For example, to run something every day at 12:15 UTC, you would specify: cron(15 12 * * ? *).
- schema_change_policy (Optional) Policy for the crawler's update and deletion behavior. See Schema Change Policy below.
- lineage_configuration (Optional) Specifies data lineage configuration settings for the crawler. See Lineage Configuration below.
- recrawl_policy (Optional) A policy that specifies whether to crawl the entire dataset again, or to crawl only folders that were added since the last crawler run.. See Recrawl Policy below.
- security_configuration (Optional) The name of Security Configuration to be used by the crawler
- table_prefix (Optional) The table prefix used for catalog tables that are created.
- tags - (Optional) Key-value map of resource tags. If configured with a provider default_tags configuration block present, tags with matching keys will overwrite those defined at the provider-level.
Dynamodb Target
- path - (Required) The name of the DynamoDB table to crawl.
- scan_all - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. defaults to true.
- scan_rate - (Optional) The percentage of the configured read capacity units to use by the AWS Glue crawler. The valid values are null or a value between 0.1 to 1.5.
JDBC Target
- connection_name - (Required) The name of the connection to use to connect to the JDBC target.
- path - (Required) The path of the JDBC target.
- exclusions - (Optional) A list of glob patterns used to exclude from the crawl.
S3 Target
- path - (Required) The path to the Amazon S3 target.
- connection_name - (Optional) The name of a connection which allows crawler to access data in S3 within a VPC.
- exclusions - (Optional) A list of glob patterns used to exclude from the crawl.
- sample_size - (Optional) Sets the number of files in each leaf folder to be crawled when crawling sample files in a dataset. If not set, all the files are crawled. A valid value is an integer between 1 and 249.
- event_queue_arn - (Optional) The ARN of the SQS queue to receive S3 notifications from.
- dlq_event_queue_arn - (Optional) The ARN of the dead-letter SQS queue.
Catalog Target
- database_name - (Required) The name of the Glue database to be synchronized.
- tables - (Required) A list of catalog tables to be synchronized.
Note:deletion_behavior of catalog target doesn't support DEPRECATE_IN_DATABASE
Note:configuration for catalog target crawlers will have { ... "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas"} } by default.
MongoDB Target
- connection_name - (Required) The name of the connection to use to connect to the Amazon DocumentDB or MongoDB target.
- path - (Required) The path of the Amazon DocumentDB or MongoDB target (database/collection).
- scan_all - (Optional) Indicates whether to scan all the records, or to sample rows from the table. Scanning all the records can take a long time when the table is not a high throughput table. Default value is true.
Schema Change Policy
- delete_behavior - (Optional) The deletion behavior when the crawler finds a deleted object. Valid values: LOG, DELETE_FROM_DATABASE, or DEPRECATE_IN_DATABASE. Defaults to DEPRECATE_IN_DATABASE.
- update_behavior - (Optional) The update behavior when the crawler finds a changed schema. Valid values: LOG or UPDATE_IN_DATABASE. Defaults to UPDATE_IN_DATABASE.
Lineage Configuration
- crawler_lineage_settings - (Optional) Specifies whether data lineage is enabled for the crawler. Valid values are: ENABLE and DISABLE. Default value is Disable.
Recrawl Policy
- recrawl_behavior - (Optional) Specifies whether to crawl the entire dataset again or to crawl only folders that were added since the last crawler run. Valid Values are: CRAWL_EVERYTHING and CRAWL_NEW_FOLDERS_ONLY. Default value is CRAWL_EVERYTHING.
属性引用
- id - Crawler name
- arn - The ARN of the crawler
- tags_all - A map of tags assigned to the resource, including those inherited from the provider default_tags configuration block.
资源导入
替换如下命令行中的参数${crawler_job}参数并运行,可以导入Glue Crawler:
$ terraform import aws_glue_crawler.${crawler_job} ${crawler_job}