Asterisk - The Open Source Telephony Project  21.4.1
Enumerations | Functions
utf8.h File Reference

UTF-8 information and validation functions. More...

Go to the source code of this file.

Enumerations

enum  ast_utf8_replace_result { AST_UTF8_REPLACE_VALID, AST_UTF8_REPLACE_INVALID, AST_UTF8_REPLACE_OVERRUN }
 
enum  ast_utf8_validation_result { AST_UTF8_VALID, AST_UTF8_INVALID, AST_UTF8_UNKNOWN }
 

Functions

void ast_utf8_copy_string (char *dst, const char *src, size_t size)
 Copy a string safely ensuring valid UTF-8. More...
 
int ast_utf8_init (void)
 Register UTF-8 tests. More...
 
int ast_utf8_is_valid (const char *str)
 Check if a zero-terminated string is valid UTF-8. More...
 
int ast_utf8_is_validn (const char *str, size_t size)
 Check if the first size bytes of a string are valid UTF-8. More...
 
enum ast_utf8_replace_result ast_utf8_replace_invalid_chars (char *dst, size_t *dst_size, const char *src, size_t src_len)
 Copy a string safely replacing any invalid UTF-8 sequences. More...
 
void ast_utf8_validator_destroy (struct ast_utf8_validator *validator)
 Destroy a UTF-8 validator. More...
 
enum ast_utf8_validation_result ast_utf8_validator_feed (struct ast_utf8_validator *validator, const char *data)
 Feed a zero-terminated string into the UTF-8 validator. More...
 
enum ast_utf8_validation_result ast_utf8_validator_feedn (struct ast_utf8_validator *validator, const char *data, size_t size)
 Feed a string into the UTF-8 validator. More...
 
int ast_utf8_validator_new (struct ast_utf8_validator **validator)
 Create a new UTF-8 validator. More...
 
void ast_utf8_validator_reset (struct ast_utf8_validator *validator)
 Reset the state of a UTF-8 validator. More...
 
enum ast_utf8_validation_result ast_utf8_validator_state (struct ast_utf8_validator *validator)
 Get the current UTF-8 validator state. More...
 

Detailed Description

UTF-8 information and validation functions.

Definition in file utf8.h.

Enumeration Type Documentation

Enumerator
AST_UTF8_REPLACE_VALID 

Source contained fully valid UTF-8.

The entire string was valid UTF-8 and no replacement was required.

AST_UTF8_REPLACE_INVALID 

Source contained at least 1 invalid UTF-8 sequence.

Parts of the string contained invalid UTF-8 sequences but those were successfully replaced with the U+FFFD replacement sequence.

AST_UTF8_REPLACE_OVERRUN 

Not enough space to copy entire source.

The destination buffer wasn't large enough to copy all of the source characters. As many of the source characters that could be copied/replaced were done so and a final NULL terminator added.

Definition at line 70 of file utf8.h.

70  {
71  /*! \brief Source contained fully valid UTF-8
72  *
73  * The entire string was valid UTF-8 and no replacement
74  * was required.
75  */
77 
78  /*! \brief Source contained at least 1 invalid UTF-8 sequence
79  *
80  * Parts of the string contained invalid UTF-8 sequences
81  * but those were successfully replaced with the U+FFFD
82  * replacement sequence.
83  */
85 
86  /*! \brief Not enough space to copy entire source
87  *
88  * The destination buffer wasn't large enough to copy
89  * all of the source characters. As many of the source
90  * characters that could be copied/replaced were done so
91  * and a final NULL terminator added.
92  */
94 };
Not enough space to copy entire source.
Definition: utf8.h:93
Source contained at least 1 invalid UTF-8 sequence.
Definition: utf8.h:84
Source contained fully valid UTF-8.
Definition: utf8.h:76
Enumerator
AST_UTF8_VALID 

The consumed sequence is valid UTF-8.

The bytes consumed thus far by the validator represent a valid sequence of UTF-8 bytes. If additional bytes are fed into the validator, it can transition into either AST_UTF8_INVALID or AST_UTF8_UNKNOWN

AST_UTF8_INVALID 

The consumed sequence is invalid UTF-8.

The bytes consumed thus far by the validator represent an invalid sequence of UTF-8 bytes. Feeding additional bytes into the validator will not change its state.

AST_UTF8_UNKNOWN 

The validator is in an intermediate state.

The validator is in the process of validating a multibyte UTF-8 sequence and requires additional data to be fed into it to determine validity. If additional bytes are fed into the validator, it can transition into either AST_UTF8_VALID or AST_UTF8_INVALID. If you have no additional data to feed into the validator the UTF-8 sequence is invalid.

Definition at line 123 of file utf8.h.

123  {
124  /*! \brief The consumed sequence is valid UTF-8
125  *
126  * The bytes consumed thus far by the validator represent a valid sequence of
127  * UTF-8 bytes. If additional bytes are fed into the validator, it can
128  * transition into either \a AST_UTF8_INVALID or \a AST_UTF8_UNKNOWN
129  */
131 
132  /*! \brief The consumed sequence is invalid UTF-8
133  *
134  * The bytes consumed thus far by the validator represent an invalid sequence
135  * of UTF-8 bytes. Feeding additional bytes into the validator will not
136  * change its state.
137  */
139 
140  /*! \brief The validator is in an intermediate state
141  *
142  * The validator is in the process of validating a multibyte UTF-8 sequence
143  * and requires additional data to be fed into it to determine validity. If
144  * additional bytes are fed into the validator, it can transition into either
145  * \a AST_UTF8_VALID or \a AST_UTF8_INVALID. If you have no additional data
146  * to feed into the validator the UTF-8 sequence is invalid.
147  */
149 };
The consumed sequence is invalid UTF-8.
Definition: utf8.h:138
The consumed sequence is valid UTF-8.
Definition: utf8.h:130
The validator is in an intermediate state.
Definition: utf8.h:148

Function Documentation

void ast_utf8_copy_string ( char *  dst,
const char *  src,
size_t  size 
)

Copy a string safely ensuring valid UTF-8.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0

This is similar to ast_copy_string, but it will only copy valid UTF-8 sequences from the source string into the destination buffer. If an invalid UTF-8 sequence is encountered, or the available space in the destination buffer is exhausted in the middle of an otherwise valid UTF-8 sequence, the destination buffer will be truncated to ensure that it only contains valid UTF-8.

Parameters
dstThe destination buffer.
srcThe source string
sizeThe size of the destination buffer

Definition at line 133 of file utf8.c.

134 {
135  uint32_t state = UTF8_ACCEPT;
136  char *last_good = dst;
137 
138  ast_assert(size > 0);
139 
140  while (size && *src) {
141  if (decode(&state, (uint8_t) *src) == UTF8_REJECT) {
142  /* We _could_ replace with U+FFFD and try to recover, but for now
143  * we treat this the same as if we had run out of space */
144  break;
145  }
146 
147  *dst++ = *src++;
148  size--;
149 
150  if (size && state == UTF8_ACCEPT) {
151  /* last_good is where we will ultimately write the 0 byte */
152  last_good = dst;
153  }
154  }
155 
156  *last_good = '\0';
157 }
int ast_utf8_init ( void  )

Register UTF-8 tests.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0

Does nothing unless TEST_FRAMEWORK is defined.

Return values
0Always

Definition at line 919 of file utf8.c.

920 {
921  return 0;
922 }
int ast_utf8_is_valid ( const char *  str)

Check if a zero-terminated string is valid UTF-8.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0
Parameters
strThe zero-terminated string to check
Return values
0if the string is not valid UTF-8
Non-zeroif the string is valid UTF-8

Definition at line 110 of file utf8.c.

111 {
112  uint32_t state = UTF8_ACCEPT;
113 
114  while (*src) {
115  decode(&state, (uint8_t) *src++);
116  }
117 
118  return state == UTF8_ACCEPT;
119 }
int ast_utf8_is_validn ( const char *  str,
size_t  size 
)

Check if the first size bytes of a string are valid UTF-8.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0

Similar to ast_utf8_is_valid() but checks the first size bytes or until a zero byte is reached, whichever comes first.

Parameters
strThe string to check
sizeThe number of bytes to evaluate
Return values
0if the string is not valid UTF-8
Non-zeroif the string is valid UTF-8

Definition at line 121 of file utf8.c.

122 {
123  uint32_t state = UTF8_ACCEPT;
124 
125  while (size && *src) {
126  decode(&state, (uint8_t) *src++);
127  size--;
128  }
129 
130  return state == UTF8_ACCEPT;
131 }
enum ast_utf8_replace_result ast_utf8_replace_invalid_chars ( char *  dst,
size_t *  dst_size,
const char *  src,
size_t  src_len 
)

Copy a string safely replacing any invalid UTF-8 sequences.

This is similar to ast_copy_string, but it will only copy valid UTF-8 sequences from the source string into the destination buffer. If an invalid sequence is encountered, it's replaced with the sequence which is the valid UTF-8 sequence that represents an unknown, unrecognized, or unrepresentable character. Since is actually a 3 byte sequence, the destination buffer will need to be larger than the corresponding source string if it contains invalid sequences. You can pass NULL as the destination buffer pointer to get the actual size required, then call the function again with the properly sized buffer.

Parameters
dstPointer to the destination buffer. If NULL, dst_size will be set to the size of the buffer required to fully process the source string.
dst_sizeA pointer to the size of the dst buffer
srcThe source string
src_lenThe number of bytes to copy
Returns
ast_utf8_replace_result

Definition at line 173 of file utf8.c.

References AST_UTF8_INVALID, AST_UTF8_REPLACE_INVALID, AST_UTF8_REPLACE_OVERRUN, AST_UTF8_REPLACE_VALID, and REPL_SEQ.

Referenced by ast_channel_publish_varset().

175 {
177  size_t src_pos = 0;
178  size_t dst_pos = 0;
179  uint32_t prev_state = UTF8_ACCEPT;
180  uint32_t curr_state = UTF8_ACCEPT;
181  /*
182  * UTF-8 sequences can be 1 - 4 bytes in length so we
183  * have to keep track of where we are.
184  */
185  int seq_len = 0;
186 
187  if (dst) {
188  memset(dst, 0, *dst_size);
189  } else {
190  *dst_size = 0;
191  }
192 
193  if (!src || src_len == 0) {
194  return AST_UTF8_REPLACE_VALID;
195  }
196 
197  for (prev_state = 0, curr_state = 0; src_pos < src_len; prev_state = curr_state, src_pos++) {
198  uint32_t rc;
199 
200  rc = decode(&curr_state, (uint8_t) src[src_pos]);
201 
202  if (dst && dst_pos >= *dst_size - 1) {
203  if (prev_state > UTF8_REJECT) {
204  /*
205  * We ran out of space in the middle of a possible
206  * multi-byte sequence so we have to back up and
207  * overwrite the start of the sequence with the
208  * NULL terminator.
209  */
210  dst_pos -= (seq_len - (prev_state / 36));
211  }
212  dst[dst_pos] = '\0';
213 
215  }
216 
217  if (rc == UTF8_ACCEPT) {
218  if (dst) {
219  dst[dst_pos] = src[src_pos];
220  }
221  dst_pos++;
222  seq_len = 0;
223  }
224 
225  if (rc > UTF8_REJECT) {
226  /*
227  * We're possibly at the start of, or in the middle of,
228  * a multi-byte sequence. The curr_state will tell us how many
229  * bytes _should_ be remaining in the sequence.
230  */
231  if (prev_state == UTF8_ACCEPT) {
232  /* If the previous state was a good character then
233  * this can only be the start of s sequence
234  * which is all we care about.
235  */
236  seq_len = curr_state / 36 + 1;
237  }
238 
239  if (dst) {
240  dst[dst_pos] = src[src_pos];
241  }
242  dst_pos++;
243  }
244 
245  if (rc == UTF8_REJECT) {
246  /* We got at least 1 rejection so the string is invalid */
248 
249  if (prev_state != UTF8_ACCEPT) {
250  /*
251  * If we were in a multi-byte sequence and this
252  * byte isn't valid at this time, we'll back
253  * the destination pointer back to the start
254  * of the now-invalid sequence and write the
255  * replacement bytes there. Then we'll
256  * process the current byte again in the next
257  * loop iteration. It may be quite valid later.
258  */
259  dst_pos -= (seq_len - (prev_state / 36));
260  src_pos--;
261  }
262  if (dst) {
263  /*
264  * If we're not just calculating the needed destination
265  * buffer space, and we don't have enough room to write
266  * the replacement sequence, terminate the output
267  * and return.
268  */
269  if (dst_pos > *dst_size - 4) {
270  dst[dst_pos] = '\0';
272  }
273  memcpy(&dst[dst_pos], REPL_SEQ, REPL_SEQ_LEN);
274  }
275  dst_pos += REPL_SEQ_LEN;
276  /* Reset the state machine */
277  curr_state = UTF8_ACCEPT;
278  }
279  }
280 
281  if (curr_state != UTF8_ACCEPT) {
282  /*
283  * We were probably in the middle of a
284  * sequence and ran out of space.
285  */
286  res = AST_UTF8_INVALID;
287  dst_pos -= (seq_len - (prev_state / 36));
288  if (dst) {
289  if (dst_pos > *dst_size - 4) {
290  dst[dst_pos] = '\0';
292  }
293  memcpy(&dst[dst_pos], REPL_SEQ, REPL_SEQ_LEN);
294  }
295  dst_pos += REPL_SEQ_LEN;
296  }
297 
298  if (dst) {
299  dst[dst_pos] = '\0';
300  } else {
301  *dst_size = dst_pos + 1;
302  }
303 
304  return res;
305 }
Not enough space to copy entire source.
Definition: utf8.h:93
The consumed sequence is invalid UTF-8.
Definition: utf8.h:138
Source contained at least 1 invalid UTF-8 sequence.
Definition: utf8.h:84
Source contained fully valid UTF-8.
Definition: utf8.h:76
#define REPL_SEQ
Definition: utf8.c:169
ast_utf8_replace_result
Definition: utf8.h:70
void ast_utf8_validator_destroy ( struct ast_utf8_validator validator)

Destroy a UTF-8 validator.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0
Parameters
validatorThe validator instance to destroy

Definition at line 363 of file utf8.c.

364 {
365  ast_free(validator);
366 }
enum ast_utf8_validation_result ast_utf8_validator_feed ( struct ast_utf8_validator validator,
const char *  data 
)

Feed a zero-terminated string into the UTF-8 validator.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0
Parameters
validatorThe validator instance
dataThe zero-terminated string to feed into the validator
Returns
The ast_utf8_validation_result indicating the current state of the validator.

Definition at line 337 of file utf8.c.

References ast_utf8_validator_state().

339 {
340  while (*data) {
341  decode(&validator->state, (uint8_t) *data++);
342  }
343 
344  return ast_utf8_validator_state(validator);
345 }
enum ast_utf8_validation_result ast_utf8_validator_state(struct ast_utf8_validator *validator)
Get the current UTF-8 validator state.
Definition: utf8.c:324
enum ast_utf8_validation_result ast_utf8_validator_feedn ( struct ast_utf8_validator validator,
const char *  data,
size_t  size 
)

Feed a string into the UTF-8 validator.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0

Similar to ast_utf8_validator_feed but will stop feeding in data if a zero byte is encountered or size bytes have been read.

Parameters
validatorThe validator instance
dataThe string to feed into the validator
sizeThe number of bytes to feed into the validator
Returns
The ast_utf8_validation_result indicating the current state of the validator.

Definition at line 347 of file utf8.c.

References ast_utf8_validator_state().

349 {
350  while (size && *data) {
351  decode(&validator->state, (uint8_t) *data++);
352  size--;
353  }
354 
355  return ast_utf8_validator_state(validator);
356 }
enum ast_utf8_validation_result ast_utf8_validator_state(struct ast_utf8_validator *validator)
Get the current UTF-8 validator state.
Definition: utf8.c:324
int ast_utf8_validator_new ( struct ast_utf8_validator **  validator)

Create a new UTF-8 validator.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0
Parameters
[out]validatorThe validator instance
Return values
0on success
-1on failure

Definition at line 311 of file utf8.c.

References ast_malloc.

312 {
313  struct ast_utf8_validator *tmp = ast_malloc(sizeof(*tmp));
314 
315  if (!tmp) {
316  return 1;
317  }
318 
319  tmp->state = UTF8_ACCEPT;
320  *validator = tmp;
321  return 0;
322 }
#define ast_malloc(len)
A wrapper for malloc()
Definition: astmm.h:191
void ast_utf8_validator_reset ( struct ast_utf8_validator validator)

Reset the state of a UTF-8 validator.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0

Resets the provided UTF-8 validator to its initial state so that it can be reused.

Parameters
validatorThe validator instance to reset

Definition at line 358 of file utf8.c.

359 {
360  validator->state = UTF8_ACCEPT;
361 }
enum ast_utf8_validation_result ast_utf8_validator_state ( struct ast_utf8_validator validator)

Get the current UTF-8 validator state.

Since
13.36.0, 16.13.0, 17.7.0, 18.0.0
Parameters
validatorThe validator instance
Returns
The ast_utf8_validation_result indicating the current state of the validator.

Definition at line 324 of file utf8.c.

References AST_UTF8_INVALID, AST_UTF8_UNKNOWN, and AST_UTF8_VALID.

Referenced by ast_utf8_validator_feed(), and ast_utf8_validator_feedn().

326 {
327  switch (validator->state) {
328  case UTF8_ACCEPT:
329  return AST_UTF8_VALID;
330  case UTF8_REJECT:
331  return AST_UTF8_INVALID;
332  default:
333  return AST_UTF8_UNKNOWN;
334  }
335 }
The consumed sequence is invalid UTF-8.
Definition: utf8.h:138
The consumed sequence is valid UTF-8.
Definition: utf8.h:130
The validator is in an intermediate state.
Definition: utf8.h:148