MySQL: Implementation of User Defined (Loadable) Function and using BLOB to store BigInteger

In this writing I will give a hands-on example and explain the followings:
- What is User Defined Function(UDF) in MySQL database and why should we use them?
- How to implement UDF in MySQL?
- Pros and cons.
Pre-requirements to fully understand this writing:
- A basic understanding of C/C++ or a similar language since MySQL only allows using either of these languages.
- Basic SQL and MySQL knowledge
- Basics of Docker
- C++ is used to write the UDF.
- Python’s PyMySQL package is used to test.
- I used Docker to host the MySQL server.
What and Why UDF?
Those who already know the UDF skip to the Implementation
In the MySQL database, there are pre-implemented functions such as MIN(), MAX(), AVG(), SUM() and a lot. We usually use them as follows:
SELECT MAX(column_name) FROM table_name;
In addition to the existing functions provided by the MySQL server, we can write our own functions e.g. MY_DEFINED_FUNC, and use them like this:
SELECT MY_DEFINED_FUNC(column_name) FROM table_name;
MySQL provides an API for this which is quite easy to follow but for those who have not implemented it yet, this writing may help them to write their first UDF.
The complete list can be found here.
Why should we need UDF?
The first question before doing something should always start with “WHY”.
The answer may not be convincing for you now but for some people like me, it was necessary to improve some performance gain.
Case 1: Imagine that we have millions of records for a column. When may need to do some calculation on those data, usually, we will use a SELECT query to fetch all the data from the database and then do the rest of the operation.
But the above approach can be sometimes inefficient due to the large amount of data sent over the network or database protocol.
So what if we can write a function that can do all the operations inside the database server instead of our web application or database client and return the final result; just like the aggregate function SUM; wouldn’t be great to get some performance gain?
Case 2: Imagine we need to store a very large integer. MySQL BIGINT supports a maximum of 8 bytes. For most practical purposes that may be enough but there can be some use cases where we may need to store 256Bytes of INT. How to do that?
- One possible way is to convert such BIGINT to STRING and insert it VARCHAR column.
- The other way is to covert the BIGINT to BYTES and insert in in BLOB.
In my example:
- I used BLOB to store BIGINT converted into BYTE.
- Inside the UDF I used GMP to covert the BYTE to mpz_t. mpz_t is a type in GMP to handle multiple-precision integers.
- Since MySQL stores the data as binary in the BLOB type column, it's confusing how the integer will be returned( Big-endian or little-endian). So I converted the data using the HEX() function of MySQL.
Let’s see the full program.
The fundamentals of the UDF is explained here https://dev.mysql.com/doc/extending-mysql/8.0/en/adding-loadable-function.html
I am going to summarize the necessary part:
- The function I wanted to create was defined as big_average.
- So the SQL will be
SELECT big_average(column_name_of_blob_type) FROM table_name
- We need to define and implement the following functions.
bool big_average_init(UDF_INIT *initid, UDF_ARGS *args, char *message);void big_average_deinit(UDF_INIT *initid);void big_average_reset(UDF_INIT *initid, UDF_ARGS *args, char *is_null, char *message);void big_average_clear(UDF_INIT *initid, char *is_null, char *message);void big_average_add(UDF_INIT *initid, UDF_ARGS *args, char *is_null, char *message);char *big_average(UDF_INIT *initid, UDF_ARGS *args, char *result, unsigned long *length, char *is_null, char *error);
- Here `big_averag_` is equivalent to XXX in the API documentation. Except big_average_reset other functions are required to be implemented. _reset function for backward compatibility.
Now let’s explain how to test and run the project:
Clone it from:
Everything inside the Dockerfile is will not be difficult for people with a Basic Docker understanding.
This part
RUN g++ avg_big_num.cpp -fPIC -lgmp -shared -o /usr/lib/mysql/plugin/big_average.so -I /usr/include/mysql
Compiles the UDF source C++ file as a shared object file and as big_average.so and put in MySQL's plugin directory. (/usr/lib/mysql/plugin)
- -lgmp flag is for the GMP library. Please check the Dockerfile for this line where all the necessary development packages are installed.
RUN apt install -y build-essential wget libmysqlclient-dev libgmp3-dev python3-pip libssl-dev libffi-dev python3-dev
The most important part is the file name has to be the same as the function name.
The second most important part is after putting the .so file in the plugin directory we need to run this SQL to create the UDF function.
CREATE AGGREGATE FUNCTION big_average RETURNS STRING SONAME big_average.so
We can check if the function is created by running the following SQL.
mysql> select * from mysql.func;
+ — — — — — — -+ — — -+ — — — — — — — — + — — — — — -+
| name | ret | dl | type |
+ — — — — — — -+ — — -+ — — — — — — — — + — — — — — -+
| big_average | 0 | big_average.so | aggregate |
+ — — — — — — -+ — — -+ — — — — — — — — + — — — — — -+
1 row in set (0.00 sec)
How to test:
- First use
docker-compose up
to run the MySQL serve. It will create a container named ‘mysql_udf’.
Please configure the host port according to the available port in your environment.
Then run the test_big_avg.py to check the correctness of the UDF implementation. The python implementation is simple.
Please check the functions and let me know in the comments if you have any confusion to understand.
In the test, I inserted 5 values of 256 to 512 bit. You can play with much larger data if you want.
vals = [random.randint(2**256, 2**512) for _ in range(input_range)]
If you use MySQL CLI and run the following SQL
select big_average(hex(no_of_atoms)) from galaxy;
You will probably see some HEX value. In Python, I coveted and compare with the Python-generated average.
mysql> select big_average(hex(no_of_atoms)) from galaxy;
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — +
| big_average(hex(no_of_atoms)) |
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — +
| 0x3534633666626635633138363764343264383562373761376664633865623739653763303037353866383139343131636265323230653539333266333261373436633134356436316630636133333638353732343134383834353766366466376337646237636330633139393235376535636139623061386563393033386639 |
+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — +
1 row in set (0.00 sec)
Play with the project and let me know if you find any bugs and find something difficult to understand.
While testing, once the init_function() is called no need to run it subsequent times unless you call the drop_function().
PS: I wrote this article quite fast and didn’t have enough time to review it. Will modify it upon the reader's comment.
Let me know in the comment if this article helps you.
Happy learning :-)