Includes all processors through release

With new releases of Nifi, the number of processors have increased from the original 53 to 154 to what we currently have today! Here is a list of all processors, listed alphabetically, that are currently in Apache Nifi as of the most recent release. Each one links to a description of the processor further down. The Usage documentation available in the web ui has much more detail about each processor, it’s properties, modifiable attributes, and relationships and each processor has it’s own page in the UI, so here is just a quick overview. Again, this content is taken directly from Nifi’s Usage guide in their web UI and all credit/rights belong to them under the Apache 2.0 License.

New Processors

In 1.14.0, there are 5 new processors from the previous version, 1.13.2.

Nifi has improved their documentation, which was originally only available when running apache nifi. The documentation now is produced through the build process and has been added to their website. So if you need more information or more detail about each processor just check there.

AttributeRollingWindow

Track a Rolling Window based on evaluating an Expression Language expression on each FlowFile and add that value to the processor’s state. Each FlowFile will be emitted with the count of FlowFiles and total aggregate value of values processed in the current time window.

AttributesToCSV

Generates a CSV representation of the input FlowFile Attributes. The resulting CSV can be written to either a newly generated attribute named ‘CSVAttributes’ or written to the FlowFile as content. If the attribute value contains a comma, newline or double quote, then the attribute value will be escaped with double quotes. Any double quote characters in the attribute value are escaped with another double quote.

AttributesToJSON

Generates a JSON representation of the input FlowFile Attributes. The resulting JSON can be written to either a new Attribute ‘JSONAttributes’ or written to the FlowFile as content.

Base64EncodeContent

Encodes or decodes content to and from base64

CalculateRecordStats

A processor that can count the number of items in a record set, as well as provide counts based on user-defined criteria on subsets of the record set.

CaptureChangeMySQL

Retrieves Change Data Capture (CDC) events from a MySQL database. CDC Events include INSERT, UPDATE, DELETE operations. Events are output as individual flow files ordered by the time at which the operation occurred.

CompareFuzzyHash

Compares an attribute containing a Fuzzy Hash against a file containing a list of fuzzy hashes, appending an attribute to the FlowFile in case of a successful match.

CompressContent

Compresses or decompresses the contents of FlowFiles using a user-specified compression algorithm and updates the mime.type attribute as appropriate. This processor operates in a very memory efficient way so very large objects well beyond the heap size are generally fine to process

ConnectWebSocket

Acts as a WebSocket client endpoint to interact with a remote WebSocket server. FlowFiles are transferred to downstream relationships according to received message types as WebSocket client configured with this processor receives messages from remote WebSocket server.

ConsumeAMQP

Consumes AMQP Messages from an AMQP Broker using the AMQP 0.9.1 protocol. Each message that is received from the AMQP Broker will be emitted as its own FlowFile to the ‘success’ relationship.

ConsumeAzureEventHub

Receives messages from Azure Event Hubs, writing the contents of the message to the content of the FlowFile.

ConsumeEWS

Consumes messages from Microsoft Exchange using Exchange Web Services. The raw-bytes of each received email message are written as contents of the FlowFile

ConsumeGCPubSub

Consumes message from the configured Google Cloud PubSub subscription. If the ‘Batch Size’ is set, the configured number of messages will be pulled in a single request, else only one message will be pulled.

ConsumeIMAP

Consumes messages from Email Server using IMAP protocol. The raw-bytes of each received email message are written as contents of the FlowFile

ConsumeJMS

Consumes JMS Message of type BytesMessage, TextMessage, ObjectMessage, MapMessage or StreamMessage transforming its content to a FlowFile and transitioning it to ‘success’ relationship. JMS attributes such as headers and properties will be copied as FlowFile attributes. MapMessages will be transformed into JSONs and then into byte arrays. The other types will have their raw contents as byte array transferred into the flowfile.

ConsumeKafka_1_0

Consumes messages from Apache Kafka specifically built against the Kafka 1.0 Consumer API. The complementary NiFi processor for sending messages is PublishKafka_1_0.

ConsumeKafka_2_0

Consumes messages from Apache Kafka specifically built against the Kafka 2.0 Consumer API. The complementary NiFi processor for sending messages is PublishKafka_2_0.

ConsumeKafka_2_6

Consumes messages from Apache Kafka specifically built against the Kafka 2.6 Consumer API. The complementary NiFi processor for sending messages is PublishKafka_2_6.

ConsumeKafkaRecord_1_0

Consumes messages from Apache Kafka specifically built against the Kafka 1.0 Consumer API. The complementary NiFi processor for sending messages is PublishKafkaRecord_1_0. Please note that, at this time, the Processor assumes that all records that are retrieved from a given partition have the same schema. If any of the Kafka messages are pulled but cannot be parsed or written with the configured Record Reader or Record Writer, the contents of the message will be written to a separate FlowFile, and that FlowFile will be transferred to the ‘parse.failure’ relationship. Otherwise, each FlowFile is sent to the ‘success’ relationship and may contain many individual messages within the single FlowFile. A ‘record.count’ attribute is added to indicate how many messages are contained in the FlowFile. No two Kafka messages will be placed into the same FlowFile if they have different schemas, or if they have different values for a message header that is included by the property.

ConsumeKafkaRecord_2_0

Consumes messages from Apache Kafka specifically built against the Kafka 2.0 Consumer API. The complementary NiFi processor for sending messages is PublishKafkaRecord_2_0. Please note that, at this time, the Processor assumes that all records that are retrieved from a given partition have the same schema. If any of the Kafka messages are pulled but cannot be parsed or written with the configured Record Reader or Record Writer, the contents of the message will be written to a separate FlowFile, and that FlowFile will be transferred to the ‘parse.failure’ relationship. Otherwise, each FlowFile is sent to the ‘success’ relationship and may contain many individual messages within the single FlowFile. A ‘record.count’ attribute is added to indicate how many messages are contained in the FlowFile. No two Kafka messages will be placed into the same FlowFile if they have different schemas, or if they have different values for a message header that is included by the property.

ConsumeKafkaRecord_2_6

Consumes messages from Apache Kafka specifically built against the Kafka 2.6 Consumer API. The complementary NiFi processor for sending messages is PublishKafkaRecord_2_6. Please note that, at this time, the Processor assumes that all records that are retrieved from a given partition have the same schema. If any of the Kafka messages are pulled but cannot be parsed or written with the configured Record Reader or Record Writer, the contents of the message will be written to a separate FlowFile, and that FlowFile will be transferred to the ‘parse.failure’ relationship. Otherwise, each FlowFile is sent to the ‘success’ relationship and may contain many individual messages within the single FlowFile. A ‘record.count’ attribute is added to indicate how many messages are contained in the FlowFile. No two Kafka messages will be placed into the same FlowFile if they have different schemas, or if they have different values for a message header that is included by the property.

ConsumeKinesisStream

Reads data from the specified AWS Kinesis stream and outputs a FlowFile for every processed Record (raw) or a FlowFile for a batch of processed records if a Record Reader and Record Writer are configured. At-least-once delivery of all Kinesis Records within the Stream while the processor is running. AWS Kinesis Client Library can take several seconds to initialise before starting to fetch data. Uses DynamoDB for check pointing and CloudWatch (optional) for metrics. Ensure that the credentials provided have access to DynamoDB and CloudWatch (optional) along with Kinesis.

ConsumeMQTT

Subscribes to a topic and receives messages from an MQTT broker

ConsumePOP3

Consumes messages from Email Server using POP3 protocol. The raw-bytes of each received email message are written as contents of the FlowFile

ConsumeWindowsEventLog

Registers a Windows Event Log Subscribe Callback to receive FlowFiles from Events on Windows. These can be filtered via channel and XPath.

ControlRate

Controls the rate at which data is transferred to follow-on processors. If you configure a very small Time Duration, then the accuracy of the throttle gets worse. You can improve this accuracy by decreasing the Yield Duration, at the expense of more Tasks given to the processor.

ConvertAvroToJSON

Converts a Binary Avro record into a JSON object. This processor provides a direct mapping of an Avro field to a JSON field, such that the resulting JSON will have the same hierarchical structure as the Avro document. Note that the Avro schema information will be lost, as this is not a translation from binary Avro to JSON formatted Avro. The output JSON is encoded the UTF-8 encoding. If an incoming FlowFile contains a stream of multiple Avro records, the resultant FlowFile will contain a JSON Array containing all of the Avro records or a sequence of JSON Objects. If an incoming FlowFile does not contain any records, an empty JSON object is the output. Empty/Single Avro record FlowFile inputs are optionally wrapped in a container as dictated by ‘Wrap Single Record’

ConvertAvroToORC

Converts an Avro record into ORC file format. This processor provides a direct mapping of an Avro record to an ORC record, such that the resulting ORC file will have the same hierarchical structure as the Avro document. If an incoming FlowFile contains a stream of multiple Avro records, the resultant FlowFile will contain a ORC file containing all of the Avro records. If an incoming FlowFile does not contain any records, an empty ORC file is the output. NOTE: Many Avro datatypes (collections, primitives, and unions of primitives, e.g.) can be converted to ORC, but unions of collections and other complex datatypes may not be able to be converted to ORC.

ConvertAvroToParquet

Converts Avro records into Parquet file format. The incoming FlowFile should be a valid avro file. If an incoming FlowFile does not contain any records, an empty parquet file is the output. NOTE: Many Avro datatypes (collections, primitives, and unions of primitives, e.g.) can be converted to parquet, but unions of collections and other complex datatypes may not be able to be converted to Parquet.

ConvertCharacterSet

Converts a FlowFile’s content from one character set to another

ConvertExcelToCSVProcessor

Consumes a Microsoft Excel document and converts each worksheet to csv. Each sheet from the incoming Excel document will generate a new Flowfile that will be output from this processor. Each output Flowfile’s contents will be formatted as a csv file where the each row from the excel sheet is output as a newline in the csv file. This processor is currently only capable of processing .xlsx (XSSF 2007 OOXML file format) Excel documents and not older .xls (HSSF ‘97(-2007) file format) documents. This processor also expects well formatted CSV content and will not escape cell’s containing invalid content such as newlines or additional commas.

ConvertJSONToSQL

Converts a JSON-formatted FlowFile into an UPDATE, INSERT, or DELETE SQL statement. The incoming FlowFile is expected to be “flat” JSON message, meaning that it consists of a single JSON element and each field maps to a simple type. If a field maps to a JSON object, that JSON object will be interpreted as Text. If the input is an array of JSON elements, each element in the array is output as a separate FlowFile to the ‘sql’ relationship. Upon successful conversion, the original FlowFile is routed to the ‘original’ relationship and the SQL is routed to the ‘sql’ relationship.

ConvertRecord

Converts records from one data format to another using configured Record Reader and Record Write Controller Services. The Reader and Writer must be configured with “matching” schemas. By this, we mean the schemas must have the same field names. The types of the fields do not have to be the same if a field value can be coerced from one type to another. For instance, if the input schema has a field named “balance” of type double, the output schema can have a field named “balance” with a type of string, double, or float. If any field is present in the input that is not present in the output, the field will be left out of the output. If any field is specified in the output schema but is not present in the input data/schema, then the field will not be present in the output or will have a null value, depending on the writer.

CountText

Counts various metrics on incoming text. The requested results will be recorded as attributes. The resulting flowfile will not have its content modified.

CreateHadoopSequenceFile

Creates Hadoop Sequence Files from incoming flow files

CryptographicHashAttribute

Calculates a hash value for each of the specified attributes using the given algorithm and writes it to an output attribute. Please refer to https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions for help to decide which algorithm to use.

CryptographicHashContent

Calculates a cryptographic hash value for the flowfile content using the given algorithm and writes it to an output attribute. Please refer to https://csrc.nist.gov/Projects/Hash-Functions/NIST-Policy-on-Hash-Functions for help to decide which algorithm to use.

DebugFlow

The DebugFlow processor aids testing and debugging the FlowFile framework by allowing various responses to be explicitly triggered in response to the receipt of a FlowFile or a timer event without a FlowFile if using timer or cron based scheduling. It can force responses needed to exercise or test various failure modes that can occur when a processor runs.

DecryptContentPGP

Decrypt Contents of OpenPGP Messages

DeleteAzureBlobStorage

Deletes the provided blob from Azure Storage

DeleteAzureDataLakeStorage

Deletes the provided file from Azure Data Lake Storage

DeleteByQueryElasticsearch

Delete from an Elasticsearch index using a query. The query can be loaded from a flowfile body or from the Query parameter.

DeleteDynamoDB

Deletes a document from DynamoDB based on hash and range key. The key can be string or number. The request requires all the primary keys for the operation (hash or hash and range key)

DeleteGCSObject

Deletes objects from a Google Cloud Bucket. If attempting to delete a file that does not exist, FlowFile is routed to success.

DeleteGridFS

Deletes a file from GridFS using a file name or a query.

DeleteHBaseCells

This processor allows the user to delete individual HBase cells by specifying one or more lines in the flowfile content that are a sequence composed of row ID, column family, column qualifier and associated visibility labels if visibility labels are enabled and in use. A user-defined separator is used to separate each of these pieces of data on each line, with :::: being the default separator.

DeleteHBaseRow

Delete HBase records individually or in batches. The input can be a single row ID in the flowfile content, one ID per line, row IDs separated by a configurable separator character (default is a comma).

DeleteHDFS

Deletes one or more files or directories from HDFS. The path can be provided as an attribute from an incoming FlowFile, or a statically set path that is periodically removed. If this processor has an incoming connection, itwill ignore running on a periodic basis and instead rely on incoming FlowFiles to trigger a delete. Note that you may use a wildcard character to match multiple files or directories. If there are no incoming connections no flowfiles will be transfered to any output relationships. If there is an incoming flowfile then provided there are no detected failures it will be transferred to success otherwise it will be sent to false. If knowledge of globbed files deleted is necessary use ListHDFS first to produce a specific list of files to delete.

DeleteMongo

Executes a delete query against a MongoDB collection. The query is provided in the body of the flowfile and the user can select whether it will delete one or many documents that match it.

DeleteRethinkDB

Processor to remove a JSON document from RethinkDB (https://www.rethinkdb.com/) using the document id.

DeleteS3Object

Deletes FlowFiles on an Amazon S3 Bucket. If attempting to delete a file that does not exist, FlowFile is routed to success.

DeleteSQS

Deletes a message from an Amazon Simple Queuing Service Queue

DetectDuplicate

Caches a value, computed from FlowFile attributes, for each incoming FlowFile and determines if the cached value has already been seen. If so, routes the FlowFile to ‘duplicate’ with an attribute named ‘original.identifier’ that specifies the original FlowFile’s “description”, which is specified in the property. If the FlowFile is not determined to be a duplicate, the Processor routes the FlowFile to 'non-duplicate'

DistributeLoad

Distributes FlowFiles to downstream processors based on a Distribution Strategy. If using the Round Robin strategy, the default is to assign each destination a weighting of 1 (evenly distributed). However, optional propertiescan be added to the change this; adding a property with the name ‘5’ and value ‘10’ means that the relationship with name ‘5’ will be receive 10 FlowFiles in each iteration instead of 1.

DuplicateFlowFile

Intended for load testing, this processor will create the configured number of copies of each incoming FlowFile. The original FlowFile as well as all generated copies are sent to the ‘success’ relationship. In addition, each FlowFile gets an attribute ‘copy.index’ set to the copy number, where the original FlowFile gets a value of zero, and all copies receive incremented integer values.

EncryptContent

Encrypts or Decrypts a FlowFile using either symmetric encryption with a raw key or password and randomly generated salt, or asymmetric encryption using a public and secret key.

EncryptContentPGP

Encrypt Contents using OpenPGP

EnforceOrder

Enforces expected ordering of FlowFiles that belong to the same data group within a single node. Although PriorityAttributePrioritizer can be used on a connection to ensure that flow files going through that connection are in priority order, depending on error-handling, branching, and other flow designs, it is possible for FlowFiles to get out-of-order. EnforceOrder can be used to enforce original ordering for those FlowFiles. [IMPORTANT] In order to take effect of EnforceOrder, FirstInFirstOutPrioritizer should be used at EVERY downstream relationship UNTIL the order of FlowFiles physically get FIXED by operation such as MergeContent or being stored to the final destination.

EvaluateJsonPath

Evaluates one or more JsonPath expressions against the content of a FlowFile. The results of those expressions are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. JsonPaths are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is flowfile-attribute; otherwise, the property name is ignored). The value of the property must be a valid JsonPath expression. A Return Type of ‘auto-detect’ will make a determination based off the configured destination. When ‘Destination’ is set to ‘flowfile-attribute,’ a return type of ‘scalar’ will be used. When ‘Destination’ is set to ‘flowfile-content,’ a return type of ‘JSON’ will be used.If the JsonPath evaluates to a JSON array or JSON object and the Return Type is set to ‘scalar’ the FlowFile will be unmodified and will be routed to failure. A Return Type of JSON can return scalar values if the provided JsonPath evaluates to the specified value and will be routed as a match.If Destination is ‘flowfile-content’ and the JsonPath does not evaluate to a defined path, the FlowFile will be routed to ‘unmatched’ without having its contents modified. If Destination is flowfile-attribute and the expression matches nothing, attributes will be created with empty strings as the value, and the FlowFile will always be routed to ‘matched.’

EvaluateXPath

Evaluates one or more XPaths against the content of a FlowFile. The results of those XPaths are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. XPaths are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is flowfile-attribute; otherwise, the property name is ignored). The value of the property must be a valid XPath expression. If the XPath evaluates to more than one node and the Return Type is set to ‘nodeset’ (either directly, or via ‘auto-detect’ with a Destination of ‘flowfile-content’), the FlowFile will be unmodified and will be routed to failure. If the XPath does not evaluate to a Node, the FlowFile will be routed to ‘unmatched’ without having its contents modified. If Destination is flowfile-attribute and the expression matches nothing, attributes will be created with empty strings as the value, and the FlowFile will always be routed to ‘matched’

EvaluateXQuery

Evaluates one or more XQueries against the content of a FlowFile. The results of those XQueries are assigned to FlowFile Attributes or are written to the content of the FlowFile itself, depending on configuration of the Processor. XQueries are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed (if the Destination is ‘flowfile-attribute’; otherwise, the property name is ignored). The value of the property must be a valid XQuery. If the XQuery returns more than one result, new attributes or FlowFiles (for Destinations of ‘flowfile-attribute’ or ‘flowfile-content’ respectively) will be created for each result (attributes will have a ‘.n’ one-up number appended to the specified attribute name). If any provided XQuery returns a result, the FlowFile(s) will be routed to ‘matched’. If no provided XQuery returns a result, the FlowFile will be routed to ‘unmatched’. If the Destination is ‘flowfile-attribute’ and the XQueries matche nothing, no attributes will be applied to the FlowFile.

ExecuteGroovyScript

Experimental Extended Groovy script processor. The script is responsible for handling the incoming flow file (transfer to SUCCESS or remove, e.g.) as well as any flow files created by the script. If the handling is incomplete or incorrect, the session will be rolled back.

ExecuteInfluxDBQuery

Processor to execute InfluxDB query from the content of a FlowFile (preferred) or a scheduled query. Please check details of the supported queries in InfluxDB documentation (https://www.influxdb.com/).

ExecuteProcess

Runs an operating system command specified by the user and writes the output of that command to a FlowFile. If the command is expected to be long-running, the Processor can output the partial data on a specified interval. When this option is used, the output is expected to be in textual format, as it typically does not make sense to split binary data on arbitrary time-based intervals.

ExecuteScript

Experimental - Executes a script given the flow file and a process session. The script is responsible for handling the incoming flow file (transfer to SUCCESS or remove, e.g.) as well as any flow files created by the script. If the handling is incomplete or incorrect, the session will be rolled back. Experimental: Impact of sustained usage not yet verified.

ExecuteSQL

Executes provided SQL select query. Query result will be converted to Avro format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query, and the query may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format. FlowFile attribute ‘executesql.row.count’ indicates how many rows were selected.

ExecuteSQLRecord

Executes provided SQL select query. Query result will be converted to the format specified by a Record Writer. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query, and the query may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format. FlowFile attribute ‘executesql.row.count’ indicates how many rows were selected.

ExecuteStreamCommand

Executes an external command on the contents of a flow file, and creates a new flow file with the results of the command.

ExtractAvroMetadata

Extracts metadata from the header of an Avro datafile.

ExtractCCDAAttributes

Extracts information from an Consolidated CDA formatted FlowFile and provides individual attributes as FlowFile attributes. The attributes are named as . If the Parent is repeating, the naming will be . For example, section.act_07.observation.name=Essential hypertension

ExtractEmailAttachments

Extract attachments from a mime formatted email file, splitting them into individual flowfiles.

ExtractEmailHeaders

Using the flowfile content as source of data, extract header from an RFC compliant email file adding the relevant attributes to the flowfile. This processor does not perform extensive RFC validation but still requires a bare minimum compliance with RFC 2822

ExtractGrok

Evaluates one or more Grok Expressions against the content of a FlowFile, adding the results as attributes or replacing the content of the FlowFile with a JSON notation of the matched content

ExtractHL7Attributes

Extracts information from an HL7 (Health Level 7) formatted FlowFile and adds the information as FlowFile Attributes. The attributes are named as . If the segment is repeating, the naming will be . For example, we may have an attribute named "MHS.12" with a value of "2.1" and an attribute named "OBX_11.3" with a value of "93000^CPT4".

ExtractText

Evaluates one or more Regular Expressions against the content of a FlowFile. The results of those Regular Expressions are assigned to FlowFile Attributes. Regular Expressions are entered by adding user-defined properties; the name of the property maps to the Attribute Name into which the result will be placed. The attributes are generated differently based on the enabling of named capture groups. If named capture groups are not enabled: The first capture group, if any found, will be placed into that attribute name.But all capture groups, including the matching string sequence itself will also be provided at that attribute name with an index value provided, with the exception of a capturing group that is optional and does not match - for example, given the attribute name “regex” and expression “abc(def)?(g)” we would add an attribute “regex.1” with a value of “def” if the “def” matched. If the “def” did not match, no attribute named “regex.1” would be added but an attribute named “regex.2” with a value of “g” will be added regardless.If named capture groups are enabled: Each named capture group, if found will be placed into the attributes name with the name provided. If enabled the matching string sequence itself will be placed into the attribute name. If multiple matches are enabled, and index will be applied after the first set of matches. The exception is a capturing group that is optional and does not match For example, given the attribute name “regex” and expression “abc(?def)?(?g)" we would add an attribute "regex.NAMED" with the value of "def" if the "def" matched. We would add an attribute "regex.NAMED-TWO" with the value of "g" if the "g" matched regardless. The value of the property must be a valid Regular Expressions with one or more capturing groups. If named capture groups are enabled, all capture groups must be named. If they are not, then the processor configuration will fail validation. If the Regular Expression matches more than once, only the first match will be used unless the property enabling repeating capture group is set to true. If any provided Regular Expression matches, the FlowFile(s) will be routed to 'matched'. If no provided Regular Expression matches, the FlowFile will be routed to 'unmatched' and no attributes will be applied to the FlowFile.

ExtractTNEFAttachments

Extract attachments from a mime formatted email file, splitting them into individual flowfiles.

FetchAzureBlobStorage

Retrieves contents of an Azure Storage Blob, writing the contents to the content of the FlowFile

FetchAzureDataLakeStorage

Fetch the provided file from Azure Data Lake Storage

FetchDistributedMapCache

Computes cache key(s) from FlowFile attributes, for each incoming FlowFile, and fetches the value(s) from the Distributed Map Cache associated with each key. If configured without a destination attribute, the incoming FlowFile’s content is replaced with the binary data received by the Distributed Map Cache. If there is no value stored under that key then the flow file will be routed to ‘not-found’. Note that the processor will always attempt to read the entire cached value into memory before placing it in it’s destination. This could be potentially problematic if the cached value is very large.

FetchElasticsearch

Retrieves a document from Elasticsearch using the specified connection properties and the identifier of the document to retrieve. If the cluster has been configured for authorization and/or secure transport (SSL/TLS) and the Shield plugin is available, secure connections can be made. This processor supports Elasticsearch 2.x clusters.

FetchElasticsearchHttp

Retrieves a document from Elasticsearch using the specified connection properties and the identifier of the document to retrieve. Note that the full body of the document will be read into memory before being written to a Flow File for transfer.

FetchFile

Reads the contents of a file from disk and streams it into the contents of an incoming FlowFile. Once this is done, the file is optionally moved elsewhere or deleted to help keep the file system organized.

FetchFTP

Fetches the content of a file from a remote FTP server and overwrites the contents of an incoming FlowFile with the content of the remote file.

FetchGCSObject

Fetches a file from a Google Cloud Bucket. Designed to be used in tandem with ListGCSBucket.

FetchGridFS

Retrieves one or more files from a GridFS bucket by file name or by a user-defined query.

FetchHBaseRow

Fetches a row from an HBase table. The Destination property controls whether the cells are added as flow file attributes, or the row is written to the flow file content as JSON. This processor may be used to fetch a fixed row on a interval by specifying the table and row id directly in the processor, or it may be used to dynamically fetch rows by referencing the table and row id from incoming flow files.

FetchHDFS

Retrieves a file from HDFS. The content of the incoming FlowFile is replaced by the content of the file in HDFS. The file in HDFS is left intact without any changes being made to it.

FetchParquet

Reads from a given Parquet file and writes records to the content of the flow file using the selected record writer. The original Parquet file will remain unchanged, and the content of the flow file will be replaced with records of the selected type. This processor can be used with ListHDFS or ListFile to obtain a listing of files to fetch.

FetchS3Object

Retrieves the contents of an S3 Object and writes it to the content of a FlowFile

FetchSFTP

Fetches the content of a file from a remote SFTP server and overwrites the contents of an incoming FlowFile with the content of the remote file.

FlattenJson

Provides the user with the ability to take a nested JSON document and flatten it into a simple key/value pair document. The keys are combined at each level with a user-defined separator that defaults to ‘.’. This Processor also allows to unflatten back the flattened json. It supports four kinds of flatten mode such as normal, keep-arrays, dot notation for MongoDB query and keep-primitive-arrays. Default flatten mode is ‘keep-arrays’.

ForkRecord

This processor allows the user to fork a record into multiple records. The user must specify at least one Record Path, as a dynamic property, pointing to a field of type ARRAY containing RECORD objects. The processor accepts two modes: ‘split’ and ‘extract’. In both modes, there is one record generated per element contained in the designated array. In the ‘split’ mode, each generated record will preserve the same schema as given in the input but the array will contain only one element. In the ‘extract’ mode, the element of the array must be of record type and will be the generated record. Additionally, in the ‘extract’ mode, it is possible to specify if each generated record should contain all the fields of the parent records from the root level to the extracted record. This assumes that the fields to add in the record are defined in the schema of the Record Writer controller service. See examples in the additional details documentation of this processor.

FuzzyHashContent

Calculates a fuzzy/locality-sensitive hash value for the Content of a FlowFile and puts that hash value on the FlowFile as an attribute whose name is determined by the property.Note: this processor only offers non-cryptographic hash algorithms. And it should be not be seen as a replacement to the HashContent processor.Note: The underlying library loads the entirety of the streamed content into and performs result evaluations in memory. Accordingly, it is important to consider the anticipated profile of content being evaluated by this processor and the hardware supporting it especially when working against large files.

GenerateFlowFile

This processor creates FlowFiles with random data or custom content. GenerateFlowFile is useful for load testing, configuration, and simulation. Also see DuplicateFlowFile for additional load testing.

GenerateTableFetch

Generates SQL select queries that fetch “pages” of rows from a table. The partition size property, along with the table’s row count, determine the size and number of pages and generated FlowFiles. In addition, incremental fetching can be achieved by setting Maximum-Value Columns, which causes the processor to track the columns’ maximum values, thus only fetching rows whose columns’ values exceed the observed maximums. This processor is intended to be run on the Primary Node only.

This processor can accept incoming connections; the behavior of the processor is different whether incoming connections are provided:

  • If no incoming connection(s) are specified, the processor will generate SQL queries on the specified processor schedule. Expression Language is supported for many fields, but no flow file attributes are available. However the properties will be evaluated using the Variable Registry.
  • If incoming connection(s) are specified and no flow file is available to a processor task, no work will be performed.
  • If incoming connection(s) are specified and a flow file is available to a processor task, the flow file’s attributes may be used in Expression Language for such fields as Table Name and others. However, the Max-Value Columns and Columns to Return fields must be empty or refer to columns that are available in each specified table.

GeoEnrichIP

Looks up geolocation information for an IP address and adds the geo information to FlowFile attributes. The geo data is provided as a MaxMind database. The attribute that contains the IP address to lookup is provided by the ‘IP Address Attribute’ property. If the name of the attribute provided is ‘X’, then the the attributes added by enrichment will take the form X.geo.

GeoEnrichIPRecord

Looks up geolocation information for an IP address and adds the geo information to FlowFile attributes. The geo data is provided as a MaxMind database. This version uses the NiFi Record API to allow large scale enrichment of record-oriented data sets. Each field provided by the MaxMind database can be directed to a field of the user’s choosing by providing a record path for that field configuration.

GetAzureEventHub

Receives messages from Microsoft Azure Event Hubs, writing the contents of the Azure message to the content of the FlowFile. Note: Please be aware that this processor creates a thread pool of 4 threads for Event Hub Client. They will be extra threads other than the concurrent tasks scheduled for this processor.

GetAzureQueueStorage

Retrieves the messages from an Azure Queue Storage. The retrieved messages will be deleted from the queue by default. If the requirement is to consume messages without deleting them, set ‘Auto Delete Messages’ to ‘false’. Note: There might be chances of receiving duplicates in situations like when a message is received but was unable to be deleted from the queue due to some unexpected situations.

GetCouchbaseKey

Get a document from Couchbase Server via Key/Value access. The ID of the document to fetch may be supplied by setting the property. NOTE: if the Document Id property is not set, the contents of the FlowFile will be read to determine the Document Id, which means that the contents of the entire FlowFile will be buffered in memory.

GetDynamoDB

Retrieves a document from DynamoDB based on hash and range key. The key can be string or number.For any get request all the primary keys are required (hash or hash and range based on the table keys).A Json Document (‘Map’) attribute of the DynamoDB item is read into the content of the FlowFile.

GetFile

Creates FlowFiles from files in a directory. NiFi will ignore files it doesn’t have at least read permissions for.

GetFTP

Fetches files from an FTP Server and creates FlowFiles from them

GetHBase

This Processor polls HBase for any records in the specified table. The processor keeps track of the timestamp of the cells that it receives, so that as new records are pushed to HBase, they will automatically be pulled. Each record is output in JSON format, as {“row”: “", "cells": { "<column 1 family>:<column 1 qualifier>": "<cell 1 value>", "<column 2 family>:<column 2 qualifier>": "<cell 2 value>", ... }}. For each record received, a Provenance RECEIVE event is emitted with the format hbase://<table name>/, where is the UTF-8 encoded value of the row's key.

GetHDFS

Fetch files from Hadoop Distributed File System (HDFS) into FlowFiles. This Processor will delete the file from HDFS after fetching it.

GetHDFSEvents

This processor polls the notification events provided by the HdfsAdmin API. Since this uses the HdfsAdmin APIs it is required to run as an HDFS super user. Currently there are six types of events (append, close, create, metadata, rename, and unlink). Please see org.apache.hadoop.hdfs.inotify.Event documentation for full explanations of each event. This processor will poll for new events based on a defined duration. For each event received a new flow file will be created with the expected attributes and the event itself serialized to JSON and written to the flow file’s content. For example, if event.type is APPEND then the content of the flow file will contain a JSON file containing the information about the append event. If successful the flow files are sent to the ‘success’ relationship. Be careful of where the generated flow files are stored. If the flow files are stored in one of processor’s watch directories there will be a never ending flow of events. It is also important to be aware that this processor must consume all events. The filtering must happen within the processor. This is because the HDFS admin’s event notifications API does not have filtering.

GetHDFSFileInfo

Retrieves a listing of files and directories from HDFS. This processor creates a FlowFile(s) that represents the HDFS file/dir with relevant information. Main purpose of this processor to provide functionality similar to HDFS Client, i.e. count, du, ls, test, etc. Unlike ListHDFS, this processor is stateless, supports incoming connections and provides information on a dir level.

GetHDFSSequenceFile

Fetch sequence files from Hadoop Distributed File System (HDFS) into FlowFiles

GetHTMLElement

Extracts HTML element values from the incoming flowfile’s content using a CSS selector. The incoming HTML is first converted into a HTML Document Object Model so that HTML elements may be selected in the similar manner that CSS selectors are used to apply styles to HTML. The resulting HTML DOM is then “queried” using the user defined CSS selector string. The result of “querying” the HTML DOM may produce 0-N results. If no results are found the flowfile will be transferred to the “element not found” relationship to indicate so to the end user. If N results are found a new flowfile will be created and emitted for each result. The query result will either be placed in the content of the new flowfile or as an attribute of the new flowfile. By default the result is written to an attribute. This can be controlled by the “Destination” property. Resulting query values may also have data prepended or appended to them by setting the value of property “Prepend Element Value” or “Append Element Value”. Prepended and appended values are treated as string values and concatenated to the result retrieved from the HTML DOM query operation. A more thorough reference for the CSS selector syntax can be found at “http://jsoup.org/apidocs/org/jsoup/select/Selector.html”

GetHTTP

This processor is deprecated and may be removed in future releases.

GetJMSQueue

This processor is deprecated and may be removed in future releases.

GetJMSTopic

This processor is deprecated and may be removed in future releases.

GetMongo

Creates FlowFiles from documents in MongoDB loaded by a user-specified query.

GetMongoRecord

A record-based version of GetMongo that uses the Record writers to write the MongoDB result set.

GetRethinkDB

Processor to get a JSON document from RethinkDB (https://www.rethinkdb.com/) using the document id. The FlowFile will contain the retrieved document

GetSFTP

Fetches files from an SFTP Server and creates FlowFiles from them

GetSmbFile

Reads file from a samba network location to FlowFiles. Use this processor instead of a cifs mounts if share access control is important. Configure the Hostname, Share and Directory accordingly: \[Hostname][Share][path\to\Directory]

GetSNMP

Retrieves information from SNMP Agent with SNMP Get request and outputs a FlowFile with information in attributes and without any content

GetSolr

Queries Solr and outputs the results as a FlowFile in the format of XML or using a Record Writer

GetSplunk

Retrieves data from Splunk Enterprise.

GetSQS

Fetches messages from an Amazon Simple Queuing Service Queue

GetTCP

Connects over TCP to the provided endpoint(s). Received data will be written as content to the FlowFile

GetTwitter

Pulls status changes from Twitter’s streaming API. In versions starting with 1.9.0, the Consumer Key and Access Token are marked as sensitive according to https://developer.twitter.com/en/docs/basics/authentication/guides/securing-keys-and-tokens

HandleHttpRequest

Starts an HTTP Server and listens for HTTP Requests. For each request, creates a FlowFile and transfers to ‘success’. This Processor is designed to be used in conjunction with the HandleHttpResponse Processor in order to create a Web Service. In case of a multipart request, one FlowFile is generated for each part.

HandleHttpResponse

Sends an HTTP Response to the Requestor that generated a FlowFile. This Processor is designed to be used in conjunction with the HandleHttpRequest in order to create a web service.

HashAttribute

Hashes together the key/value pairs of several flowfile attributes and adds the hash as a new attribute. Optional properties are to be added such that the name of the property is the name of a flowfile attribute to consider and the value of the property is a regular expression that, if matched by the attribute value, will cause that attribute to be used as part of the hash. If the regular expression contains a capturing group, only the value of the capturing group will be used. For a processor which accepts various attributes and generates a cryptographic hash of each, see “CryptographicHashAttribute”.

HashContent

This processor is deprecated and may be removed in future releases.

IdentifyMimeType

Attempts to identify the MIME Type used for a FlowFile. If the MIME Type can be identified, an attribute with the name ‘mime.type’ is added with the value being the MIME Type. If the MIME Type cannot be determined, the value will be set to ‘application/octet-stream’. In addition, the attribute mime.extension will be set if a common file extension for the MIME Type is known. If both Config File and Config Body are not set, the default NiFi MIME Types will be used.

InvokeAWSGatewayApi

Client for AWS Gateway API endpoint

InvokeGRPC

Sends FlowFiles, optionally with content, to a configurable remote gRPC service endpoint. The remote gRPC service must abide by the service IDL defined in NiFi. gRPC isn’t intended to carry large payloads, so this processor should be used only when FlowFile sizes are on the order of megabytes. The default maximum message size is 4MB.

InvokeHTTP

An HTTP client processor which can interact with a configurable HTTP Endpoint. The destination URL and HTTP Method are configurable. FlowFile attributes are converted to HTTP headers and the FlowFile contents are included as the body of the request (if the HTTP Method is PUT, POST or PATCH).

InvokeScriptedProcessor

Experimental - Invokes a script engine for a Processor defined in the given script. The script must define a valid class that implements the Processor interface, and it must set a variable ‘processor’ to an instance of the class. Processor methods such as onTrigger() will be delegated to the scripted Processor instance. Also any Relationships or PropertyDescriptors defined by the scripted processor will be added to the configuration dialog. The scripted processor can implement public void setLogger(ComponentLog logger) to get access to the parent logger, as well as public void onScheduled(ProcessContext context) and public void onStopped(ProcessContext context) methods to be invoked when the parent InvokeScriptedProcessor is scheduled or stopped, respectively. Experimental: Impact of sustained usage not yet verified.

ISPEnrichIP

Looks up ISP information for an IP address and adds the information to FlowFile attributes. The ISP data is provided as a MaxMind ISP database (Note that this is NOT the same as the GeoLite database utilizedby some geo enrichment tools). The attribute that contains the IP address to lookup is provided by the ‘IP Address Attribute’ property. If the name of the attribute provided is ‘X’, then the the attributes added by enrichment will take the form X.isp.

JoltTransformJSON

Applies a list of Jolt specifications to the flowfile JSON payload. A new FlowFile is created with transformed content and is routed to the ‘success’ relationship. If the JSON transform fails, the original FlowFile is routed to the ‘failure’ relationship.

JoltTransformRecord

Applies a list of Jolt specifications to the FlowFile payload. A new FlowFile is created with transformed content and is routed to the ‘success’ relationship. If the transform fails, the original FlowFile is routed to the ‘failure’ relationship.

JsonQueryElasticsearch

A processor that allows the user to run a query (with aggregations) written with the Elasticsearch JSON DSL. It does not automatically paginate queries for the user. If an incoming relationship is added to this processor, it will use the flowfile’s content for the query. Care should be taken on the size of the query because the entire response from Elasticsearch will be loaded into memory all at once and converted into the resulting flowfiles.

ListAzureBlobStorage

Lists blobs in an Azure Storage container. Listing details are attached to an empty FlowFile for use with FetchAzureBlobStorage. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

ListAzureDataLakeStorage

Lists directory in an Azure Data Lake Storage Gen 2 filesystem

ListDatabaseTables

Generates a set of flow files, each containing attributes corresponding to metadata about a table from a database connection. Once metadata about a table has been fetched, it will not be fetched again until the Refresh Interval (if set) has elapsed, or until state has been manually cleared.

ListenFTP

Starts an FTP server that listens on the specified port and transforms incoming files into FlowFiles. The URI of the service will be ftp://{hostname}:{port}. The default port is 2221.

ListenGRPC

Starts a gRPC server and listens on the given port to transform the incoming messages into FlowFiles. The message format is defined by the standard gRPC protobuf IDL provided by NiFi. gRPC isn’t intended to carry large payloads, so this processor should be used only when FlowFile sizes are on the order of megabytes. The default maximum message size is 4MB.

ListenHTTP

Starts an HTTP Server and listens on a given base path to transform incoming requests into FlowFiles. The default URI of the Service will be http://{hostname}:{port}/contentListener. Only HEAD and POST requests are supported. GET, PUT, and DELETE will result in an error and the HTTP response status code 405. GET is supported on /healthcheck. If the service is available, it returns "200 OK" with the content "OK". The health check functionality can be configured to be accessible via a different port. For details see the documentation of the "Listening Port for health check requests" property.

ListenRELP

Listens for RELP messages being sent to a given port over TCP. Each message will be acknowledged after successfully writing the message to a FlowFile. Each FlowFile will contain data portion of one or more RELP frames. In the case where the RELP frames contain syslog messages, the output of this processor can be sent to a ParseSyslog processor for further processing.

ListenSMTP

This processor implements a lightweight SMTP server to an arbitrary port, allowing nifi to listen for incoming email. Note this server does not perform any email validation. If direct exposure to the internet is sought, it may be a better idea to use the combination of NiFi and an industrial scale MTA (e.g. Postfix). Threading for this processor is managed by the underlying smtp server used so the processor need not support more than one thread.

ListenSyslog

Listens for Syslog messages being sent to a given port over TCP or UDP. Incoming messages are checked against regular expressions for RFC5424 and RFC3164 formatted messages. The format of each message is: ()(VERSION )(TIMESTAMP) (HOSTNAME) (BODY) where version is optional. The timestamp can be an RFC5424 timestamp with a format of "yyyy-MM-dd'T'HH:mm:ss.SZ" or "yyyy-MM-dd'T'HH:mm:ss.S+hh:mm", or it can be an RFC3164 timestamp with a format of "MMM d HH:mm:ss". If an incoming messages matches one of these patterns, the message will be parsed and the individual pieces will be placed in FlowFile attributes, with the original message in the content of the FlowFile. If an incoming message does not match one of these patterns it will not be parsed and the syslog.valid attribute will be set to false with the original message in the content of the FlowFile. Valid messages will be transferred on the success relationship, and invalid messages will be transferred on the invalid relationship.

ListenTCP

Listens for incoming TCP connections and reads data from each connection using a line separator as the message demarcator. The default behavior is for each message to produce a single FlowFile, however this can be controlled by increasing the Batch Size to a larger value for higher throughput. The Receive Buffer Size must be set as large as the largest messages expected to be received, meaning if every 100kb there is a line separator, then the Receive Buffer Size must be greater than 100kb.

ListenTCPRecord

Listens for incoming TCP connections and reads data from each connection using a configured record reader, and writes the records to a flow file using a configured record writer. The type of record reader selected will determine how clients are expected to send data. For example, when using a Grok reader to read logs, a client can keep an open connection and continuously stream data, but when using an JSON reader, the client cannot send an array of JSON documents and then send another array on the same connection, as the reader would be in a bad state at that point. Records will be read from the connection in blocking mode, and will timeout according to the Read Timeout specified in the processor. If the read times out, or if any other error is encountered when reading, the connection will be closed, and any records read up to that point will be handled according to the configured Read Error Strategy (Discard or Transfer). In cases where clients are keeping a connection open, the concurrent tasks for the processor should be adjusted to match the Max Number of TCP Connections allowed, so that there is a task processing each connection.

ListenUDP

Listens for Datagram Packets on a given port. The default behavior produces a FlowFile per datagram, however for higher throughput the Max Batch Size property may be increased to specify the number of datagrams to batch together in a single FlowFile. This processor can be restricted to listening for datagrams from a specific remote host and port by specifying the Sending Host and Sending Host Port properties, otherwise it will listen for datagrams from all hosts and ports.

ListenUDPRecord

Listens for Datagram Packets on a given port and reads the content of each datagram using the configured Record Reader. Each record will then be written to a flow file using the configured Record Writer. This processor can be restricted to listening for datagrams from a specific remote host and port by specifying the Sending Host and Sending Host Port properties, otherwise it will listen for datagrams from all hosts and ports.

ListenWebSocket

Acts as a WebSocket server endpoint to accept client connections. FlowFiles are transferred to downstream relationships according to received message types as the WebSocket server configured with this processor receives client requests

ListFile

Retrieves a listing of files from the local filesystem. For each file that is listed, creates a FlowFile that represents the file so that it can be fetched in conjunction with FetchFile. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetFile, this Processor does not delete any data from the local filesystem.

ListFTP

Performs a listing of the files residing on an FTP server. For each file that is found on the remote server, a new FlowFile will be created with the filename attribute set to the name of the file on the remote server. This can then be used in conjunction with FetchFTP in order to fetch those files.

ListGCSBucket

Retrieves a listing of objects from an GCS bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchGCSObject. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

ListHDFS

Retrieves a listing of files from HDFS. Each time a listing is performed, the files with the latest timestamp will be excluded and picked up during the next execution of the processor. This is done to ensure that we do not miss any files, or produce duplicates, in the cases where files with the same timestamp are written immediately before and after a single execution of the processor. For each file that is listed in HDFS, this processor creates a FlowFile that represents the HDFS file to be fetched in conjunction with FetchHDFS. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetHDFS, this Processor does not delete any data from HDFS.

ListS3

Retrieves a listing of objects from an S3 bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchS3Object. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

ListSFTP

Performs a listing of the files residing on an SFTP server. For each file that is found on the remote server, a new FlowFile will be created with the filename attribute set to the name of the file on the remote server. This can then be used in conjunction with FetchSFTP in order to fetch those files.

LogAttribute

Emits attributes of the FlowFile at the specified log level

LogMessage

Emits a log message at the specified log level

LookupAttribute

Lookup attributes from a lookup service

LookupRecord

Extracts one or more fields from a Record and looks up a value for those fields in a LookupService. If a result is returned by the LookupService, that result is optionally added to the Record. In this case, the processor functions as an Enrichment processor. Regardless, the Record is then routed to either the ‘matched’ relationship or ‘unmatched’ relationship (if the ‘Routing Strategy’ property is configured to do so), indicating whether or not a result was returned by the LookupService, allowing the processor to also function as a Routing processor. The “coordinates” to use for looking up a value in the Lookup Service are defined by adding a user-defined property. Each property that is added will have an entry added to a Map, where the name of the property becomes the Map Key and the value returned by the RecordPath becomes the value for that key. If multiple values are returned by the RecordPath, then the Record will be routed to the ‘unmatched’ relationship (or ‘success’, depending on the ‘Routing Strategy’ property’s configuration). If one or more fields match the Result RecordPath, all fields that match will be updated. If there is no match in the configured LookupService, then no fields will be updated. I.e., it will not overwrite an existing value in the Record with a null value. Please note, however, that if the results returned by the LookupService are not accounted for in your schema (specifically, the schema that is configured for your Record Writer) then the fields will not be written out to the FlowFile.

MergeContent

Merges a Group of FlowFiles together based on a user-defined strategy and packages them into a single FlowFile. It is recommended that the Processor be configured with only a single incoming connection, as Group of FlowFiles will not be created from FlowFiles in different connections. This processor updates the mime.type attribute as appropriate.

MergeRecord

This Processor merges together multiple record-oriented FlowFiles into a single FlowFile that contains all of the Records of the input FlowFiles. This Processor works by creating ‘bins’ and then adding FlowFiles to these bins until they are full. Once a bin is full, all of the FlowFiles will be combined into a single output FlowFile, and that FlowFile will be routed to the ‘merged’ Relationship. A bin will consist of potentially many ‘like FlowFiles’. In order for two FlowFiles to be considered ‘like FlowFiles’, they must have the same Schema (as identified by the Record Reader) and, if the property is set, the same value for the specified attribute. See Processor Usage and Additional Details for more information.

ModifyBytes

Discard byte range at the start and end or all content of a binary file.

ModifyHTMLElement

Modifies the value of an existing HTML element. The desired element to be modified is located by using CSS selector syntax. The incoming HTML is first converted into a HTML Document Object Model so that HTML elements may be selected in the similar manner that CSS selectors are used to apply styles to HTML. The resulting HTML DOM is then “queried” using the user defined CSS selector string to find the element the user desires to modify. If the HTML element is found the element’s value is updated in the DOM using the value specified “Modified Value” property. All DOM elements that match the CSS selector will be updated. Once all of the DOM elements have been updated the DOM is rendered to HTML and the result replaces the flowfile content with the updated HTML. A more thorough reference for the CSS selector syntax can be found at “http://jsoup.org/apidocs/org/jsoup/select/Selector.html”

MonitorActivity

Monitors the flow for activity and sends out an indicator when the flow has not had any data for some specified amount of time and again when the flow’s activity is restored

MoveHDFS

Rename existing files or a directory of files (non-recursive) on Hadoop Distributed File System (HDFS).

Notify

Caches a release signal identifier in the distributed cache, optionally along with the FlowFile’s attributes. Any flow files held at a corresponding Wait processor will be released once this signal in the cache is discovered.

ParseCEF

Parses the contents of a CEF formatted message and adds attributes to the FlowFile for headers and extensions of the parts of the CEF message. Note: This Processor expects CEF messages WITHOUT the syslog headers (i.e. starting at “CEF:0”

ParseEvtx

Parses the contents of a Windows Event Log file (evtx) and writes the resulting XML to the FlowFile

ParseNetflowv5

Parses netflowv5 byte ingest and add to NiFi flowfile as attributes or JSON content.

ParseSyslog

Attempts to parses the contents of a Syslog message in accordance to RFC5424 and RFC3164 formats and adds attributes to the FlowFile for each of the parts of the Syslog message.Note: Be mindfull that RFC3164 is informational and a wide range of different implementations are present in the wild. If messages fail parsing, considering using RFC5424 or using a generic parsing processors such as ExtractGrok.

ParseSyslog5424

Attempts to parse the contents of a well formed Syslog message in accordance to RFC5424 format and adds attributes to the FlowFile for each of the parts of the Syslog message, including Structured Data.Structured Data will be written to attributes as one attribute per item id + parameter see https://tools.ietf.org/html/rfc5424.Note: ParseSyslog5424 follows the specification more closely than ParseSyslog. If your Syslog producer does not follow the spec closely, with regards to using ‘-‘ for missing header entries for example, those logs will fail with this parser, where they would not fail with ParseSyslog.

PartitionRecord

Receives Record-oriented data (i.e., data that can be read by the configured Record Reader) and evaluates one or more RecordPaths against the each record in the incoming FlowFile. Each record is then grouped with other “like records” and a FlowFile is created for each group of “like records.” What it means for two records to be “like records” is determined by user-defined properties. The user is required to enter at least one user-defined property whose value is a RecordPath. Two records are considered alike if they have the same value for all configured RecordPaths. Because we know that all records in a given output FlowFile have the same value for the fields that are specified by the RecordPath, an attribute is added for each field. See Additional Details on the Usage page for more information and examples.

PostHTTP

This processor is deprecated and may be removed in future releases.

PostSlack

Sends a message on Slack. The FlowFile content (e.g. an image) can be uploaded and attached to the message.

PublishAMQP

Creates an AMQP Message from the contents of a FlowFile and sends the message to an AMQP Exchange. In a typical AMQP exchange model, the message that is sent to the AMQP Exchange will be routed based on the ‘Routing Key’ to its final destination in the queue (the binding). If due to some misconfiguration the binding between the Exchange, Routing Key and Queue is not set up, the message will have no final destination and will return (i.e., the data will not make it to the queue). If that happens you will see a log in both app-log and bulletin stating to that effect, and the FlowFile will be routed to the ‘failure’ relationship.

PublishGCPubSub

Publishes the content of the incoming flowfile to the configured Google Cloud PubSub topic. The processor supports dynamic properties. If any dynamic properties are present, they will be sent along with the message in the form of ‘attributes’.

PublishJMS

Creates a JMS Message from the contents of a FlowFile and sends it to a JMS Destination (queue or topic) as JMS BytesMessage or TextMessage. FlowFile attributes will be added as JMS headers and/or properties to the outgoing JMS message.

PublishKafka_1_0

Sends the contents of a FlowFile as a message to Apache Kafka using the Kafka 1.0 Producer API.The messages to send may be individual FlowFiles or may be delimited, using a user-specified delimiter, such as a new-line. The complementary NiFi processor for fetching messages is ConsumeKafka_1_0.

PublishKafka_2_0

Sends the contents of a FlowFile as a message to Apache Kafka using the Kafka 2.0 Producer API.The messages to send may be individual FlowFiles or may be delimited, using a user-specified delimiter, such as a new-line. The complementary NiFi processor for fetching messages is ConsumeKafka_2_0.

PublishKafka_2_6

Sends the contents of a FlowFile as a message to Apache Kafka using the Kafka 2.5 Producer API.The messages to send may be individual FlowFiles or may be delimited, using a user-specified delimiter, such as a new-line. The complementary NiFi processor for fetching messages is ConsumeKafka_2_6.

PublishKafkaRecord_1_0

Sends the contents of a FlowFile as individual records to Apache Kafka using the Kafka 1.0 Producer API. The contents of the FlowFile are expected to be record-oriented data that can be read by the configured Record Reader. The complementary NiFi processor for fetching messages is ConsumeKafkaRecord_1_0.

PublishKafkaRecord_2_0

Sends the contents of a FlowFile as individual records to Apache Kafka using the Kafka 2.0 Producer API. The contents of the FlowFile are expected to be record-oriented data that can be read by the configured Record Reader. The complementary NiFi processor for fetching messages is ConsumeKafkaRecord_2_0.

PublishKafkaRecord_2_6

Sends the contents of a FlowFile as individual records to Apache Kafka using the Kafka 2.5 Producer API. The contents of the FlowFile are expected to be record-oriented data that can be read by the configured Record Reader. The complementary NiFi processor for fetching messages is ConsumeKafkaRecord_2_6.

PublishMQTT

Publishes a message to an MQTT topic

PutAzureBlobStorage

Puts content into an Azure Storage Blob

PutAzureCosmosDBRecord

This processor is a record-aware processor for inserting data into Cosmos DB with Core SQL API. It uses a configured record reader and schema to read an incoming record set from the body of a Flowfile and then inserts those records into a configured Cosmos DB Container.

PutAzureDataLakeStorage

Puts content into an Azure Data Lake Storage Gen 2

PutAzureEventHub

Sends the contents of a FlowFile to Windows Azure Event Hubs. Note: the content of the FlowFile will be buffered into memory before being sent, so care should be taken to avoid sending FlowFiles to this Processor that exceed the amount of Java Heap Space available. Also please be aware that this processor creates a thread pool of 4 threads for Event Hub Client. They will be extra threads other than the concurrent tasks scheduled for this processor.

PutAzureQueueStorage

Writes the content of the incoming FlowFiles to the configured Azure Queue Storage.

PutBigQueryBatch

Batch loads flow files content to a Google BigQuery table.

PutBigQueryStreaming

Load data into Google BigQuery table using the streaming API. This processor is not intended to load large flow files as it will load the full content into memory. If you need to insert large flow files, consider using PutBigQueryBatch instead.

PutCassandraQL

Execute provided Cassandra Query Language (CQL) statement on a Cassandra 1.x, 2.x, or 3.0.x cluster. The content of an incoming FlowFile is expected to be the CQL command to execute. The CQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention cql.args.N.type and cql.args.N.value, where N is a positive integer. The cql.args.N.type is expected to be a lowercase string indicating the Cassandra type.

PutCassandraRecord

This is a record aware processor that reads the content of the incoming FlowFile as individual records using the configured ‘Record Reader’ and writes them to Apache Cassandra using native protocol version 3 or higher.

PutCloudWatchMetric

Publishes metrics to Amazon CloudWatch. Metric can be either a single value, or a StatisticSet comprised of minimum, maximum, sum and sample count.

PutCouchbaseKey

Put a document to Couchbase Server via Key/Value access.

PutDatabaseRecord

The PutDatabaseRecord processor uses a specified RecordReader to input (possibly multiple) records from an incoming flow file. These records are translated to SQL statements and executed as a single transaction. If any errors occur, the flow file is routed to failure or retry, and if the records are transmitted successfully, the incoming flow file is routed to success. The type of statement executed by the processor is specified via the Statement Type property, which accepts some hard-coded values such as INSERT, UPDATE, and DELETE, as well as ‘Use statement.type Attribute’, which causes the processor to get the statement type from a flow file attribute. IMPORTANT: If the Statement Type is UPDATE, then the incoming records must not alter the value(s) of the primary keys (or user-specified Update Keys). If such records are encountered, the UPDATE statement issued to the database may do nothing (if no existing records with the new primary key values are found), or could inadvertently corrupt the existing data (by changing records for which the new values of the primary keys exist).

PutDistributedMapCache

Gets the content of a FlowFile and puts it to a distributed map cache, using a cache key computed from FlowFile attributes. If the cache already contains the entry and the cache update strategy is ‘keep original’ the entry is not replaced.’

PutDynamoDB

Puts a document from DynamoDB based on hash and range key. The table can have either hash and range or hash key alone. Currently the keys supported are string and number and value can be json document. In case of hash and range keys both key are required for the operation. The FlowFile content must be JSON. FlowFile content is mapped to the specified Json Document attribute in the DynamoDB item.

PutElasticsearch

Writes the contents of a FlowFile to Elasticsearch, using the specified parameters such as the index to insert into and the type of the document. If the cluster has been configured for authorization and/or secure transport (SSL/TLS) and the Shield plugin is available, secure connections can be made. This processor supports Elasticsearch 2.x clusters.

PutElasticsearchHttp

Writes the contents of a FlowFile to Elasticsearch, using the specified parameters such as the index to insert into and the type of the document.

PutElasticsearchHttpRecord

Writes the records from a FlowFile into to Elasticsearch, using the specified parameters such as the index to insert into and the type of the document, as well as the operation type (index, upsert, delete, etc.). Note: The Bulk API is used to send the records. This means that the entire contents of the incoming flow file are read into memory, and each record is transformed into a JSON document which is added to a single HTTP request body. For very large flow files (files with a large number of records, e.g.), this could cause memory usage issues.

PutElasticsearchRecord

A record-aware Elasticsearch put processor that uses the official Elastic REST client libraries.

PutEmail

Sends an e-mail to configured recipients for each incoming FlowFile

PutFile

Writes the contents of a FlowFile to the local file system

PutFTP

Sends FlowFiles to an FTP Server

PutGCSObject

Puts flow files to a Google Cloud Bucket.

PutGridFS

Writes a file to a GridFS bucket.

PutHBaseCell

Adds the Contents of a FlowFile to HBase as the value of a single cell

PutHBaseJSON

Adds rows to HBase based on the contents of incoming JSON documents. Each FlowFile must contain a single UTF-8 encoded JSON document, and any FlowFiles where the root element is not a single document will be routed to failure. Each JSON field name and value will become a column qualifier and value of the HBase row. Any fields with a null value will be skipped, and fields with a complex value will be handled according to the Complex Field Strategy. The row id can be specified either directly on the processor through the Row Identifier property, or can be extracted from the JSON document by specifying the Row Identifier Field Name property. This processor will hold the contents of all FlowFiles for the given batch in memory at one time.

PutHBaseRecord

Adds rows to HBase based on the contents of a flowfile using a configured record reader.

PutHDFS

Write FlowFile data to Hadoop Distributed File System (HDFS)

PutHiveQL

Executes a HiveQL DDL/DML command (UPDATE, INSERT, e.g.). The content of an incoming FlowFile is expected to be the HiveQL command to execute. The HiveQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention hiveql.args.N.type and hiveql.args.N.value, where N is a positive integer. The hiveql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format.

PutHiveStreaming

This processor uses Hive Streaming to send flow file data to an Apache Hive table. The incoming flow file is expected to be in Avro format and the table must exist in Hive. Please see the Hive documentation for requirements on the Hive table (format, partitions, etc.). The partition values are extracted from the Avro record based on the names of the partition columns as specified in the processor. NOTE: If multiple concurrent tasks are configured for this processor, only one table can be written to at any time by a single thread. Additional tasks intending to write to the same table will wait for the current task to finish writing to the table.

PutHTMLElement

Places a new HTML element in the existing HTML DOM. The desired position for the new HTML element is specified by using CSS selector syntax. The incoming HTML is first converted into a HTML Document Object Model so that HTML DOM location may be located in a similar manner that CSS selectors are used to apply styles to HTML. The resulting HTML DOM is then “queried” using the user defined CSS selector string to find the position where the user desires to add the new HTML element. Once the new HTML element is added to the DOM it is rendered to HTML and the result replaces the flowfile content with the updated HTML. A more thorough reference for the CSS selector syntax can be found at “http://jsoup.org/apidocs/org/jsoup/select/Selector.html”

PutInfluxDB

Processor to write the content of a FlowFile in ‘line protocol’. Please check details of the ‘line protocol’ in InfluxDB documentation (https://www.influxdb.com/). The flow file can contain single measurement point or multiple measurement points separated by line seperator. The timestamp (last field) should be in nano-seconds resolution.

PutJMS

This processor is deprecated and may be removed in future releases.

PutKinesisFirehose

Sends the contents to a specified Amazon Kinesis Firehose. In order to send data to firehose, the firehose delivery stream name has to be specified.

PutKinesisStream

Sends the contents to a specified Amazon Kinesis. In order to send data to Kinesis, the stream name has to be specified.

PutKudu

Reads records from an incoming FlowFile using the provided Record Reader, and writes those records to the specified Kudu’s table. The schema for the Kudu table is inferred from the schema of the Record Reader. If any error occurs while reading records from the input, or writing records to Kudu, the FlowFile will be routed to failure

PutLambda

Sends the contents to a specified Amazon Lambda Function. The AWS credentials used for authentication must have permissions execute the Lambda function (lambda:InvokeFunction).The FlowFile content must be JSON.

PutMongo

Writes the contents of a FlowFile to MongoDB

PutMongoRecord

This processor is a record-aware processor for inserting data into MongoDB. It uses a configured record reader and schema to read an incoming record set from the body of a flowfile and then inserts batches of those records into a configured MongoDB collection. This processor does not support updates, deletes or upserts. The number of documents to insert at a time is controlled by the “Insert Batch Size” configuration property. This value should be set to a reasonable size to ensure that MongoDB is not overloaded with too many inserts at once.

PutParquet

Reads records from an incoming FlowFile using the provided Record Reader, and writes those records to a Parquet file. The schema for the Parquet file must be provided in the processor properties. This processor will first write a temporary dot file and upon successfully writing every record to the dot file, it will rename the dot file to it’s final name. If the dot file cannot be renamed, the rename operation will be attempted up to 10 times, and if still not successful, the dot file will be deleted and the flow file will be routed to failure. If any error occurs while reading records from the input, or writing records to the output, the entire dot file will be removed and the flow file will be routed to failure or retry, depending on the error.

PutRecord

The PutRecord processor uses a specified RecordReader to input (possibly multiple) records from an incoming flow file, and sends them to a destination specified by a Record Destination Service (i.e. record sink).

PutRethinkDB

Processor to write the JSON content of a FlowFile to RethinkDB (https://www.rethinkdb.com/). The flow file should contain either JSON Object an array of JSON documents

PutRiemann

Send events to Riemann (http://riemann.io) when FlowFiles pass through this processor. You can use events to notify Riemann that a FlowFile passed through, or you can attach a more meaningful metric, such as, the time a FlowFile took to get to this processor. All attributes attached to events support the NiFi Expression Language.

PutS3Object

Puts FlowFiles to an Amazon S3 Bucket. The upload uses either the PutS3Object method or the PutS3MultipartUpload method. The PutS3Object method sends the file in a single synchronous call, but it has a 5GB size limit. Larger files are sent using the PutS3MultipartUpload method. This multipart process saves state after each step so that a large upload can be resumed with minimal loss if the processor or cluster is stopped and restarted. A multipart upload consists of three steps: 1) initiate upload, 2) upload the parts, and 3) complete the upload. For multipart uploads, the processor saves state locally tracking the upload ID and parts uploaded, which must both be provided to complete the upload. The AWS libraries select an endpoint URL based on the AWS region, but this can be overridden with the ‘Endpoint Override URL’ property for use with other S3-compatible endpoints. The S3 API specifies that the maximum file size for a PutS3Object upload is 5GB. It also requires that parts in a multipart upload must be at least 5MB in size, except for the last part. These limits establish the bounds for the Multipart Upload Threshold and Part Size properties.

PutSFTP

Sends FlowFiles to an SFTP Server

PutSlack

Sends a message to your team on slack.com

PutSmbFile

Writes the contents of a FlowFile to a samba network location. Use this processor instead of a cifs mounts if share access control is important.Configure the Hostname, Share and Directory accordingly: \[Hostname][Share][path\to\Directory]

PutSNS

Sends the content of a FlowFile as a notification to the Amazon Simple Notification Service

PutSolrContentStream

Sends the contents of a FlowFile as a ContentStream to Solr

PutSolrRecord

Indexes the Records from a FlowFile into Solr

PutSplunk

Sends logs to Splunk Enterprise over TCP, TCP + TLS/SSL, or UDP. If a Message Delimiter is provided, then this processor will read messages from the incoming FlowFile based on the delimiter, and send each message to Splunk. If a Message Delimiter is not provided then the content of the FlowFile will be sent directly to Splunk as if it were a single message.

PutSplunkHTTP

Sends flow file content to the specified Splunk server over HTTP or HTTPS. Supports HEC Index Acknowledgement.

PutSQL

Executes a SQL UPDATE or INSERT command. The content of an incoming FlowFile is expected to be the SQL command to execute. The SQL command may use the ? to escape parameters. In this case, the parameters to use must exist as FlowFile attributes with the naming convention sql.args.N.type and sql.args.N.value, where N is a positive integer. The sql.args.N.type is expected to be a number indicating the JDBC Type. The content of the FlowFile is expected to be in UTF-8 format.

PutSQS

Publishes a message to an Amazon Simple Queuing Service Queue

PutSyslog

Sends Syslog messages to a given host and port over TCP or UDP. Messages are constructed from the “Message ___” properties of the processor which can use expression language to generate messages from incoming FlowFiles. The properties are used to construct messages of the form: ()(VERSION )(TIMESTAMP) (HOSTNAME) (BODY) where version is optional. The constructed messages are checked against regular expressions for RFC5424 and RFC3164 formatted messages. The timestamp can be an RFC5424 timestamp with a format of "yyyy-MM-dd'T'HH:mm:ss.SZ" or "yyyy-MM-dd'T'HH:mm:ss.S+hh:mm", or it can be an RFC3164 timestamp with a format of "MMM d HH:mm:ss". If a message is constructed that does not form a valid Syslog message according to the above description, then it is routed to the invalid relationship. Valid messages are sent to the Syslog server and successes are routed to the success relationship, failures routed to the failure relationship.

PutTCP

The PutTCP processor receives a FlowFile and transmits the FlowFile content over a TCP connection to the configured TCP server. By default, the FlowFiles are transmitted over the same TCP connection (or pool of TCP connections if multiple input threads are configured). To assist the TCP server with determining message boundaries, an optional “Outgoing Message Delimiter” string can be configured which is appended to the end of each FlowFiles content when it is transmitted over the TCP connection. An optional “Connection Per FlowFile” parameter can be specified to change the behaviour so that each FlowFiles content is transmitted over a single TCP connection which is opened when the FlowFile is received and closed after the FlowFile has been sent. This option should only be used for low message volume scenarios, otherwise the platform may run out of TCP sockets.

PutUDP

The PutUDP processor receives a FlowFile and packages the FlowFile content into a single UDP datagram packet which is then transmitted to the configured UDP server. The user must ensure that the FlowFile content being fed to this processor is not larger than the maximum size for the underlying UDP transport. The maximum transport size will vary based on the platform setup but is generally just under 64KB. FlowFiles will be marked as failed if their content is larger than the maximum transport size.

PutWebSocket

Sends messages to a WebSocket remote endpoint using a WebSocket session that is established by either ListenWebSocket or ConnectWebSocket.

QueryCassandra

Execute provided Cassandra Query Language (CQL) select query on a Cassandra 1.x, 2.x, or 3.0.x cluster. Query result may be converted to Avro or JSON format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query. FlowFile attribute ‘executecql.row.count’ indicates how many rows were selected.

QueryDatabaseTable

Generates a SQL select query, or uses a provided statement, and executes it to fetch all rows whose values in the specified Maximum Value column(s) are larger than the previously-seen maxima. Query result will be converted to Avro format. Expression Language is supported for several properties, but no incoming connections are permitted. The Variable Registry may be used to provide values for any property containing Expression Language. If it is desired to leverage flow file attributes to perform these queries, the GenerateTableFetch and/or ExecuteSQL processors can be used for this purpose. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer or cron expression, using the standard scheduling methods. This processor is intended to be run on the Primary Node only. FlowFile attribute ‘querydbtable.row.count’ indicates how many rows were selected.

QueryDatabaseTableRecord

Generates a SQL select query, or uses a provided statement, and executes it to fetch all rows whose values in the specified Maximum Value column(s) are larger than the previously-seen maxima. Query result will be converted to the format specified by the record writer. Expression Language is supported for several properties, but no incoming connections are permitted. The Variable Registry may be used to provide values for any property containing Expression Language. If it is desired to leverage flow file attributes to perform these queries, the GenerateTableFetch and/or ExecuteSQL processors can be used for this purpose. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer or cron expression, using the standard scheduling methods. This processor is intended to be run on the Primary Node only. FlowFile attribute ‘querydbtable.row.count’ indicates how many rows were selected.

QueryDNS

A powerful DNS query processor primary designed to enrich DataFlows with DNS based APIs (e.g. RBLs, ShadowServer’s ASN lookup) but that can be also used to perform regular DNS lookups.

QueryElasticsearchHttp

Queries Elasticsearch using the specified connection properties. Note that the full body of each page of documents will be read into memory before being written to Flow Files for transfer. Also note that the Elasticsearch max_result_window index setting is the upper bound on the number of records that can be retrieved using this query. To retrieve more records, use the ScrollElasticsearchHttp processor.

QueryRecord

Evaluates one or more SQL queries against the contents of a FlowFile. The result of the SQL query then becomes the content of the output FlowFile. This can be used, for example, for field-specific filtering, transformation, and row-level filtering. Columns can be renamed, simple calculations and aggregations performed, etc. The Processor is configured with a Record Reader Controller Service and a Record Writer service so as to allow flexibility in incoming and outgoing data formats. The Processor must be configured with at least one user-defined property. The name of the Property is the Relationship to route data to, and the value of the Property is a SQL SELECT statement that is used to specify how input data should be transformed/filtered. The SQL statement must be valid ANSI SQL and is powered by Apache Calcite. If the transformation fails, the original FlowFile is routed to the ‘failure’ relationship. Otherwise, the data selected will be routed to the associated relationship. If the Record Writer chooses to inherit the schema from the Record, it is important to note that the schema that is inherited will be from the ResultSet, rather than the input Record. This allows a single instance of the QueryRecord processor to have multiple queries, each of which returns a different set of columns and aggregations. As a result, though, the schema that is derived will have no schema name, so it is important that the configured Record Writer not attempt to write the Schema Name as an attribute if inheriting the Schema from the Record. See the Processor Usage documentation for more information.

QuerySolr

Queries Solr and outputs the results as a FlowFile in the format of XML or using a Record Writer

QuerySplunkIndexingStatus

Queries Splunk server in order to acquire the status of indexing acknowledgement.

QueryWhois

A powerful whois query processor primary designed to enrich DataFlows with whois based APIs (e.g. ShadowServer’s ASN lookup) but that can be also used to perform regular whois lookups.

ReplaceText

Updates the content of a FlowFile by evaluating a Regular Expression (regex) against it and replacing the section of the content that matches the Regular Expression with some alternate value.

ReplaceTextWithMapping

Updates the content of a FlowFile by evaluating a Regular Expression against it and replacing the section of the content that matches the Regular Expression with some alternate value provided in a mapping file.

RetryFlowFile

FlowFiles passed to this Processor have a ‘Retry Attribute’ value checked against a configured ‘Maximum Retries’ value. If the current attribute value is below the configured maximum, the FlowFile is passed to a retry relationship. The FlowFile may or may not be penalized in that condition. If the FlowFile’s attribute value exceeds the configured maximum, the FlowFile will be passed to a ‘retries_exceeded’ relationship. WARNING: If the incoming FlowFile has a non-numeric value in the configured ‘Retry Attribute’ attribute, it will be reset to ‘1’. You may choose to fail the FlowFile instead of performing the reset. Additional dynamic properties can be defined for any attributes you wish to add to the FlowFiles transferred to ‘retries_exceeded’. These attributes support attribute expression language.

RouteHL7

Routes incoming HL7 data according to user-defined queries. To add a query, add a new property to the processor. The name of the property will become a new relationship for the processor, and the value is an HL7 Query Language query. If a FlowFile matches the query, a copy of the FlowFile will be routed to the associated relationship.

RouteOnAttribute

Routes FlowFiles based on their Attributes using the Attribute Expression Language

RouteOnContent

Applies Regular Expressions to the content of a FlowFile and routes a copy of the FlowFile to each destination whose Regular Expression matches. Regular Expressions are added as User-Defined Properties where the name of the property is the name of the relationship and the value is a Regular Expression to match against the FlowFile content. User-Defined properties do support the Attribute Expression Language, but the results are interpreted as literal values, not Regular Expressions

RouteText

Routes textual data based on a set of user-defined rules. Each line in an incoming FlowFile is compared against the values specified by user-defined Properties. The mechanism by which the text is compared to these user-defined properties is defined by the ‘Matching Strategy’. The data is then routed according to these rules, routing each line of the text individually.

RunMongoAggregation

A processor that runs an aggregation query whenever a flowfile is received.

SampleRecord

Samples the records of a FlowFile based on a specified sampling strategy (such as Reservoir Sampling). The resulting FlowFile may be of a fixed number of records (in the case of reservoir-based algorithms) or some subset of the total number of records (in the case of probabilistic sampling), or a deterministic number of records (in the case of interval sampling).

ScanAttribute

Scans the specified attributes of FlowFiles, checking to see if any of their values are present within the specified dictionary of terms

ScanContent

Scans the content of FlowFiles for terms that are found in a user-supplied dictionary. If a term is matched, the UTF-8 encoded version of the term will be added to the FlowFile using the ‘matching.term’ attribute

ScanHBase

Scans and fetches rows from an HBase table. This processor may be used to fetch rows from hbase table by specifying a range of rowkey values (start and/or end ),by time range, by filter expression, or any combination of them. Order of records can be controlled by a property ReversedNumber of rows retrieved by the processor can be limited.

ScriptedTransformRecord

Provides the ability to evaluate a simple script against each record in an incoming FlowFile. The script may transform the record in some way, filter the record, or fork additional records. See Processor’s Additional Details for more information.

ScrollElasticsearchHttp

Scrolls through an Elasticsearch query using the specified connection properties. This processor is intended to be run on the primary node, and is designed for scrolling through huge result sets, as in the case of a reindex. The state must be cleared before another query can be run. Each page of results is returned, wrapped in a JSON object like so: { “hits” : [ , , ] }. Note that the full body of each page of documents will be read into memory before being written to a Flow File for transfer.

SegmentContent

Segments a FlowFile into multiple smaller segments on byte boundaries. Each segment is given the following attributes: fragment.identifier, fragment.index, fragment.count, segment.original.filename; these attributes can then be used by the MergeContent processor in order to reconstitute the original FlowFile

SelectHiveQL

Execute provided HiveQL SELECT query against a Hive database connection. Query result will be converted to Avro or CSV format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query. FlowFile attribute ‘selecthiveql.row.count’ indicates how many rows were selected.

SetSNMP

Based on incoming FlowFile attributes, the processor will execute SNMP Set requests. When finding attributes with the name snmp$, the processor will attempt to set the value of the attribute to the corresponding OID given in the attribute name

SplitAvro

Splits a binary encoded Avro datafile into smaller files based on the configured Output Size. The Output Strategy determines if the smaller files will be Avro datafiles, or bare Avro records with metadata in the FlowFile attributes. The output will always be binary encoded.

SplitContent

Splits incoming FlowFiles by a specified byte sequence

SplitJson

Splits a JSON File into multiple, separate FlowFiles for an array element specified by a JsonPath expression. Each generated FlowFile is comprised of an element of the specified array and transferred to relationship ‘split,’ with the original file transferred to the ‘original’ relationship. If the specified JsonPath is not found or does not evaluate to an array element, the original file is routed to ‘failure’ and no files are generated.

SplitRecord

Splits up an input FlowFile that is in a record-oriented data format into multiple smaller FlowFiles

SplitText

Splits a text file into multiple smaller text files on line boundaries limited by maximum number of lines or total size of fragment. Each output split file will contain no more than the configured number of lines or bytes. If both Line Split Count and Maximum Fragment Size are specified, the split occurs at whichever limit is reached first. If the first line of a fragment exceeds the Maximum Fragment Size, that line will be output in a single split file which exceeds the configured maximum size limit. This component also allows one to specify that each split should include a header lines. Header lines can be computed by either specifying the amount of lines that should constitute a header or by using header marker to match against the read lines. If such match happens then the corresponding line will be treated as header. Keep in mind that upon the first failure of header marker match, no more matches will be performed and the rest of the data will be parsed as regular lines for a given split. If after computation of the header there are no more data, the resulting split will consists of only header lines.

SplitXml

Splits an XML File into multiple separate FlowFiles, each comprising a child or descendant of the original root element

SpringContextProcessor

A Processor that supports sending and receiving data from application defined in Spring Application Context via predefined in/out MessageChannels.

TagS3Object

Sets tags on a FlowFile within an Amazon S3 Bucket. If attempting to tag a file that does not exist, FlowFile is routed to success.

TailFile

“Tails” a file, or a list of files, ingesting data from the file as it is written to the file. The file is expected to be textual. Data is ingested only when a new line is encountered (carriage return or new-line character or combination). If the file to tail is periodically “rolled over”, as is generally the case with log files, an optional Rolling Filename Pattern can be used to retrieve data from files that have rolled over, even if the rollover occurred while NiFi was not running (provided that the data still exists upon restart of NiFi). It is generally advisable to set the Run Schedule to a few seconds, rather than running with the default value of 0 secs, as this Processor will consume a lot of resources if scheduled very aggressively. At this time, this Processor does not support ingesting files that have been compressed when ‘rolled over’.

TransformXml

Applies the provided XSLT file to the flowfile XML payload. A new FlowFile is created with transformed content and is routed to the ‘success’ relationship. If the XSL transform fails, the original FlowFile is routed to the ‘failure’ relationship

UnpackContent

Unpacks the content of FlowFiles that have been packaged with one of several different Packaging Formats, emitting one to many FlowFiles for each input FlowFile

UpdateAttribute

Updates the Attributes for a FlowFile by using the Attribute Expression Language and/or deletes the attributes based on a regular expression

UpdateCounter

This processor allows users to set specific counters and key points in their flow. It is useful for debugging and basic counting functions.

UpdateHiveTable

This processor uses a Hive JDBC connection and incoming records to generate any Hive 1.2 table changes needed to support the incoming records.

UpdateRecord

Updates the contents of a FlowFile that contains Record-oriented data (i.e., data that can be read via a RecordReader and written by a RecordWriter). This Processor requires that at least one user-defined Property be added. The name of the Property should indicate a RecordPath that determines the field that should be updated. The value of the Property is either a replacement value (optionally making use of the Expression Language) or is itself a RecordPath that extracts a value from the Record. Whether the Property value is determined to be a RecordPath or a literal value depends on the configuration of the Property.

ValidateCsv

Validates the contents of FlowFiles against a user-specified CSV schema. Take a look at the additional documentation of this processor for some schema examples.

ValidateRecord

Validates the Records of an incoming FlowFile against a given schema. All records that adhere to the schema are routed to the “valid” relationship while records that do not adhere to the schema are routed to the “invalid” relationship. It is therefore possible for a single incoming FlowFile to be split into two individual FlowFiles if some records are valid according to the schema and others are not. Any FlowFile that is routed to the “invalid” relationship will emit a ROUTE Provenance Event with the Details field populated to explain why records were invalid. In addition, to gain further explanation of why records were invalid, DEBUG-level logging can be enabled for the “org.apache.nifi.processors.standard.ValidateRecord” logger.

ValidateXml

Validates the contents of FlowFiles against a user-specified XML Schema file

Wait

Routes incoming FlowFiles to the ‘wait’ relationship until a matching release signal is stored in the distributed cache from a corresponding Notify processor. When a matching release signal is identified, a waiting FlowFile is routed to the ‘success’ relationship. The release signal entry is then removed from the cache. The attributes of the FlowFile that produced the release signal are copied to the waiting FlowFile if the Attribute Cache Regex property of the corresponding Notify processor is set properly. If there are multiple release signals in the cache identified by the Release Signal Identifier, and the Notify processor is configured to copy the FlowFile attributes to the cache, then the FlowFile passing the Wait processor receives the union of the attributes of the FlowFiles that produced the release signals in the cache (identified by Release Signal Identifier). Waiting FlowFiles will be routed to ‘expired’ if they exceed the Expiration Duration. If you need to wait for more than one signal, specify the desired number of signals via the ‘Target Signal Count’ property. This is particularly useful with processors that split a source FlowFile into multiple fragments, such as SplitText. In order to wait for all fragments to be processed, connect the ‘original’ relationship to a Wait processor, and the ‘splits’ relationship to a corresponding Notify processor. Configure the Notify and Wait processors to use the ‘${fragment.identifier}’ as the value of ‘Release Signal Identifier’, and specify ‘${fragment.count}’ as the value of ‘Target Signal Count’ in the Wait processor.It is recommended to use a prioritizer (for instance First In First Out) when using the ‘wait’ relationship as a loop.

YandexTranslate

Translates content and attributes from one language to another

If you have questions about a processor, I’d encourage you to download the binaries and start up Apache Nifi. Now you can see the documentation for processors in their documentation, which is now built with each release. If you really want more information, let me know and I’ll try to compile a more complete post about each and every processor.