Filtering records by ISO8601 format

Hi guys, I need some collaborative wisdom here. Thanks for helping in advance.

Given a MySQL column of ISO8601 periods stored as strings, is it possible to filter them by start date/time or end date/time from within MySQL?

As per ISO8601 specifications, it could omit later parts if it is the same as previous, meaning “2007-11-13T09:00/15T17:00” is actually happening for 4 days (from 11th to 15th).

The difficult case would be periods being able to do something like “2007-03-01T13:00:00Z/P1Y2M10DT2H30M” which doesn’t have the evaluated end time stored and hence requires calculation to see if it’s hitting the starting date criteria. The same goes for the period + end date case where the start date has to be evaluated before determining if it hits the end date criteria. Unfortunately MySQL does not come with something that deals with ISO8601 easily or natively.

In ideal cases this should be stored as evaluated values, but currently this is what I have to deal with before things get ideal due to 3rd party limitations. Running a round of calculation is not feasible in short term due to the amount of data accumulated from a decades-aged antique.

P.S. The table is read-only, no update operations are expected to be performed.

Any ideas?

I don’t know much beyond basic SQL statements, and time formatting is an absolute nightmare that I try my best to avoid, so I’m probably not the right person to answer this. However, this seems like a reasonable response, so maybe it can be used as a building block for conversation?


Here’s a general approach you can consider:

  1. Extracting start and end dates: You can use string manipulation functions in MySQL to extract the start and end dates from your ISO8601 periods.

  2. Filtering by start and end dates: Once you have extracted the start and end dates, you can use MySQL’s date functions to filter rows based on your criteria.

Here’s an example of how you could achieve this:

SELECT *
FROM your_table
WHERE
    -- Extract start and end dates
    CASE
        -- Case when period has both start and end dates
        WHEN period_column LIKE '%/%' THEN
            SUBSTRING_INDEX(period_column, '/', 1) >= 'your_start_date' AND
            SUBSTRING_INDEX(period_column, '/', -1) <= 'your_end_date'
        -- Case when period has only start date
        ELSE
            SUBSTRING_INDEX(period_column, '/', 1) >= 'your_start_date'
    END;

In this query:

  • SUBSTRING_INDEX() function is used to extract the start and end dates from the period column.
  • We’re using a CASE statement to handle periods with both start and end dates, as well as periods with only start dates.
  • Replace 'your_start_date' and 'your_end_date' with the dates you want to filter by.

It’s important to ensure that the format of your ISO8601 periods is consistent throughout the column.


What’s wrong with using PHP’s time() or MySQL’s CURRENT_TIME()?

3 Likes

Hi Greenreader9,

Thank you for your reply.

I tried this at first, but splitting “2007-11-13T09:00/15T17:00” by “/” is going to result in something like:
Start date: “2007-11-13T09:00”
End date: “15T17:00”
The parsing for this case will work if the MySQL command is crafted enough for compensating this specific case with the missing YYYY-MM at the expense of very slow performance.

The command quickly became minecraft redstone computer when period syntaxes are added to the mix.

It is safe to say the column of ISO8601s are all valid, but not necessarily consistent as ISO8601 explicitly allows the following schemas:

<start>/<end>
<start>/<period>
<period>/<end>

(this one was not included fortunately)

The goal here is to query the database of events that overlaps with a queried time frame, which is not the duty of PHP. Performing the query using PHP functions or libraries would likely require a full table retrieval which takes just as long as re-evaluating the whole thing.

The Carbon library is very attractive in this case but unfortunately that’s something I intend to use for the re-evaluation part but not as a interim solution before that completes.

I have also considered something like this but for the same reason of table scanning this could take even longer than re-evaluation I’m afraid.

MySQL function:

DELIMITER //

CREATE FUNCTION ParseISO8601Period(period_str VARCHAR(255))
RETURNS DATETIME
BEGIN
    DECLARE start_date VARCHAR(255);
    DECLARE duration VARCHAR(255);
    DECLARE end_date DATETIME;
    
    -- Extract start date and duration (if present)
    SET start_date = SUBSTRING_INDEX(period_str, '/', 1);
    SET duration = SUBSTRING_INDEX(period_str, '/', -1);
    
    -- Parse start date
    SET end_date = STR_TO_DATE(start_date, '%Y-%m-%dT%H:%i:%s');
    
    -- Handle duration (if provided)
    IF duration LIKE 'P%' THEN
        -- Duration format (e.g., P1Y2M10DT2H30M)
        SET duration = REPLACE(duration, 'P', '');
        SET duration = REPLACE(duration, 'T', ' ');
        SET duration = REPLACE(duration, 'Y', '-');
        SET duration = REPLACE(duration, 'M', '-');
        SET duration = REPLACE(duration, 'D', '');
        SET duration = REPLACE(duration, 'H', ':');
        SET duration = REPLACE(duration, 'S', '');
        SET end_date = DATE_ADD(end_date, INTERVAL duration SECOND);
    END IF;
    
    RETURN end_date;
END //

DELIMITER ;

Usage:

SELECT ParseISO8601Period('2007-03-01T13:00:00Z/P1Y2M10DT2H30M'); -- Example with duration
SELECT ParseISO8601Period('2007-11-13T09:00/15T17:00'); -- Example with omitted parts```

That’s not what I meant. I meant, why not just store the value outputted by the PHP function time() (Or MySQL’s CURRENT_DATE()

3 Likes

I see, that’s because the data came in that format, and before I could revamp it to something that’s easier to work with, I’m stuck with the column of ISO8601. :joy:

1 Like

I’m familiar with ISO 8601 for date and time notation, but I never knew that they had a format for intervals too.

Parsing the format entirely within MySQL seems painful, and hard to query efficiently.

So if I may suggest a different approach: could you transform the data to a format where it’s easier to do the calculations you need? The input dataset may not have it, but you could opt to either transform the data when loading the dataset, or even load it into one database table as-is, and then run a script to build a second table with the data in a format that’s easier to query.

Having an Extract-Tranform-Load (ETL) pipeline with multiple processing steps to query the data efficiently is quite common for data analysis.

7 Likes

Hi Admin,

The ETL pipeline exists, but the progress to process all data is taking quite a while, therefore an interim solution for parsing this column is needed to cater the existing demand first.

Ultimately, all data would be stored in the start end timestamps, with a column of period notations on the 3rd column.

Cheers!

1 Like

You do you, but I cannot think of any way you could parse the source data on the fly in a way that’s not ugly and slow.

Regardless of whether you use MySQL or PHP for this, it will have to do processing for every row with every query to do the necessary comparison. That just doesn’t scale. You want to be able to do indexed lookups, and to do that you will likely need to preprocess your data.

Seeing how you say that processing the data takes a long time, I’m guessing the dataset is also quite big? If so, you NEED to optimize the data for efficient querying, as whatever suboptimal parsing solution you come up with will get much slower if the dataset is large.

So my recommended “interim solution” is to tell people that they will just need to wait, because making it possible to query a large dataset takes time.

6 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.