This is a copy of a blog post I wrote originally posted on InfluxData.com
With billions of devices and applications producing time series data every nanosecond, InfluxDB is the leading way to store and analyze this data. With the enormous variety of data sources, InfluxDB provides multiple ways for users to get data into InfluxDB. One of the most common data formats of this data is CSV, comma-separated values.
This blog post demonstrates how to take CSV data, translate it into line protocol, and send it to InfluxDB using the InfluxDB CLI and InfluxDB Client libraries.
What are CSV and Line Protocol? Link to heading
CSV is data delimited by a comma to separate values. Each row is a different record, where the record consists of one or more fields. CSV data is historically used for exporting data or recording events over time, which makes it a great fit for a time series database like InfluxDB.
Here is a basic example of CSV data. It starts with a header that provides the names of each column, followed by each record in a separate row:
|
|
To get this data into InfluxDB, the following tools show how to translate these rows into line protocol. Line protocol consists of the following items:
- Measurement name: the name data to store and later query
- Fields: The actual data that will be queried
- Tags: optional string fields that are used to index and help filter data
- Timestamp: optional, but very common in CSV data to specify when the data record was collected or produced
Using the CSV example above, the goal state is to translate the data into something like the following line protocol:
|
|
In the above, the iot-devices
are the measurement name, and the building is a
tag. The temperature and humidity values are our fields. Finally, the timestamp
is saved as a nanosecond precision UNIX timestamp.
Influx CLI Link to heading
The Influx CLI tool provides commands to manage and interact with InfluxDB. With this tool, users can set up, configure, and interact with many of the capabilities of InfluxDB. From setting up new buckets and orgs to querying data, to even pushing data, the CLI can do it all.
One of the subcommands users can use is the write
command. Write allows users
to inject data directly into InfluxDB from annotated CSV.
CSV annotations Link to heading
Annotations, either in the CSV file itself or provided as CLI options, are properties of the columns in the CSV file. They describe how to translate each column into either a measurement name, tag, field, or timestamp.
The following demonstrates adding annotations to our example data to a file:
#datatype measurement,tag,double,double,dateTime:RFC3339
name,building,temperature,humidity,time
iot-devices,5a,72.3,34.1,2022-10-01T12:01:00Z
iot-devices,5a,72.1,33.8,2022-10-02T12:01:00Z
iot-devices,5a,72.2,33.7,2022-10-03T12:01:00Z
The data types in this example are specified as follows:
measurement
: states which column to use as the measurement name. If no column exists, this can also be specified as a header via the CLI.tag
: specifies which column or columns are to be treated as string tag data. These are optional, but help with querying and indexing data in InfluxDB.double
: is used on two columns to specify that they contain double data types.dateTime
: specifies that the final column contains the timestamp of the record and goes further to state that the format used is RFC3339.
Users can also specify additional data types for fields:
double
long
unsignedLong
boolean
string
ignored
: used if a column is not useful or required, and this will not include the column in the final data
Finally, for timestamps, there are built-in parsing capabilities for:
- RFC3339 (e.g.
2020-01-01T00:00:00Z
) - RFC3339Nano (e.g.
2020-01-01T00:00:00.000000000Z
) - Unix timestamps (e.g.
1577836800000000000
)
If the timestamp is not in one of these formats, then users need to specify the
format of the timestamp themselves (e.g. dateTime:2006-01-02
) as part of the
annotation using Go reference time.
CLI examples Link to heading
Once annotations exist, it is time to send the data to InfluxDB using the CLI. Below is an example of sending the data contained in a file:
|
|
If the CSV itself does not have the annotations, then a user can add them as part of the CLI command:
|
|
Finally, if a CSV file does not have a relevant column for the measurement name, that too can be included as a header:
|
|
To get started with the Influx CLI tool, visit the docs site where users can find steps to install and get started with it. Check out the Write CSV data to InfluxDB docs for more details and examples. This includes examples with skipper header rows, different encodings, and error handling. Additionally, see this previous blog post to learn more about annotated CSV and how you can write the data directly with Flux queries.
The Influx CLI provides a simple and fast way to get started, but what if the user’s files are much larger, not annotated, or need to have some preprocessing done to them before pushing to InfluxDB? In these scenarios, users should look to the InfluxDB Client Libraries.
InfluxDB Client Libraries Link to heading
The InfluxDB Client Libraries provide language-specific packages to interact with the InfluxDB v2 API quickly. This allows users to use a programming language of their choice to create, process, and package data quickly and easily and then send it to InfluxDB. The libraries are available in many languages, including Python, JavaScript, Go, C#, Java, and many others.
The following provides two examples of parsing CSV data with Python and Java and then sending that data to InfluxDB.
Python + Pandas Link to heading
The Python programming language has enabled many to learn and start programming easily. Pandas, a Python data analysis library, is a fast and powerful tool for data analysis and manipulation. Together the two make for a powerful combination to easily process data to send to InfluxDB with the InfluxDB client library.
If a user has a very large CSV file or files they want to push to InfluxDB, Pandas provides an easy way to read a CSV file with headers quickly. Combined with the built-in functionality of the InfluxDB client libraries to write Pandas DataFrames, a user can read a CSV in chunks and then send those chunks into InfluxDB.
In the following example, a user is reading a CSV containing thousands of rows containing VIX stock data:
|
|
To avoid reading the entire file into memory, the user can take advantage of
Pandas’ read_csv
function, which will read the column names based on the CSV
header and chunk the file into 1,000-row chunks. Finally, use the InfluxDB
client library to send those groups of 1,000 rows to InfluxDB after specifying
the measurement, tag, and timestamp columns:
|
|
Java Link to heading
The Java programming language sees use from a wide range of sources from Android devices to enterprise applications. Java users can look to opencsv to get started quickly with CSV data parsing.
This example makes use of a plain old Java object (aka POJO) along with annotations to tell opencsv which CSV columns belong to what object variable and the InfluxDB client library what the measurement name should be for the class as well as what variables should become tags, fields, or timestamps.
|
|
Then a user can iterate through a CSV file and create a StockData
object for
each line. This variable can then be manipulated, if required, before sending
it to InfluxDB:
|
|
Check out the InfluxDB CLI & Client Libraries today Link to heading
This post has shown how quick, easy, and flexible the InfluxDB client libraries are to use. While the above only demonstrates CSV data, it starts to demonstrate the great power users can have when sending data to InfluxDB. Combined with the other APIs, users have even more options and potential.
Consider where you might be able to use InfluxDB, Influx CLI, and the client libraries, and give them a shot today!