Data mining Journal Files
credits: Sequential Pattern Mining Approach to Extract Information from BIM Design Log Files. Saman Yarmohammadi , Reza Pourabolghasem , Arezoo Shirazi , Baabak Ashuri
Design log data in general and Autodesk Revit© journal files, in particular, are unstructured text files that BIM tools create when modelers use the application. These files capture all modeling activities that occur during a design session, as well as system information, such as memory performance and operating system.
We examine the feasibility of utilizing these log files as a non-intrusive data capturing mechanism for documenting modeller-software interactions and recording model development events.
Hypothesis: there are common command execution patterns among BIM modelers who work on similar projects.
In the context of this research, a journal log file is regarded as a database of ordered modeling events (commands) recorded with a concrete notion of time.
The following questions are of particular interest in this research:
- What types of commands sequences do modelers execute frequently?
- What structures are formed from various commands at each stage of modeling and how?
- What command pattern sequences are common among different modelers?
- Is there a resemblance between the modeling behaviour of BIM users who work on similar types of projects?
Hu et al. developed a multivariate linear regression model to predict man-hour quantity for steel fabrication projects in the planning phase.
Abdelmohsen et al. developed a cost analysis and reporting system that utilizes elements’ dimensions extracted from BIM models.
The main emphasis of these research efforts was on utilizing data stored in BIM models to enhance non-design practices, such as production planning, cost estimation, and site layout analysis.
These studies are limited to utilizing information stored as physical and functional characteristics of elements after the model is already created.
Textual data mining techniques in AEC literature can effectively extract patterns from information stored in structured relational datasets or unstructured documents.
However, these approaches are restricted to data that lack time dimension.
Soibelman and Kim outlined the steps necessary to apply data mining and Knowledge Discovery in Databases (KDD) as tools to extract novel patterns in design and construction fields.
The novel method created in this research contributes to the body of knowledge by incorporating chronological dependencies of textual records into the existing pattern matching models.
2. Research Approach
This process consists of three major parts:
- a large number of Revit journal files that belong to a design project are collected. These text files are parsed to extract and store necessary data items.
- we transform the obtained data to construct long strings of characters and generate input vectors.
- we construct Generalized Suffix Tree data structures for each user. A Depth-First Search (DFS) is later conducted on the GST structures to find common command sequences for each and among the five users
- 10 healthcare project designed in 2013
- 5000 Revit journal files
- 10 GB of structured data
- 5 modelers -> 100’000 log entries treated as long string
- The commands that each user issued were modelled as characters.
- The final string for each user was constructed by putting these characters in the original order they were recorded.
- Input vectors of modelers were then generated by joining the constructed command strings in chronological order.
Commands Sequence Extraction
Agrawal and Srikant introduced Generalized Sequential Pattern (GSP) algorithm to mine shopping patterns (1995).
- the time and memory performance of GSP is relatively low since a huge number of candidates must be generated and stored in each repetition for evaluation.
- PrefixSpan algorithm overcomes this issue by taking advantage of pattern-growth methodology following a divide-and-conquer approach called prefix-based projection.
- general Sequential Pattern Mining (SPM) methodologies do not preserve the exact order of elements in a sequence
- For the purpose of this study, a special GST-based string processing algorithm was used. This method is particularly useful as it maintains the order in which the executed commands are recorded.
- Guerbas et al. introduced an improved version of GST algorithm with an optimized data structure to extract page visit patterns of internet users
- The primary objective of the authors was to improve the search experience of users by predicting what pages they intend to view next. -> Which command would you use next?
- parse journal files and extract necessary data items using regular expression matching operations:
- modeller name
- date of the modeling session
- project name
- command execution time
- general and specific command description
- view type
- Revit version
- We manually search the files to identify the local format using which each data instance (e.g., project name, command, and view type) is recorded
- Once patterns are identified, we use a text processor to extract and store information in a CommaSeparated Values (CSV)
- assign unique characters to each specific command name to generate the necessary input strings.
- generate long strings by joining these characters in accordance to the original order in which corresponding commands are recorded in each modeling session.
- Once these strings are constructed, use suffix trees to represent all suffixes of the string set.
- This technique enables us not only to identify patterns, but also calculate their frequency.
- These substrings, then, are filtered based on simple heuristics (e.g., minimum length of the substring, minimum frequency, etc.).
- At the end of this step, we have a limited number of most frequent substring of commands that is common between all the users.
- We call these substrings as primitives.
- We repeat this process for all users to extract their frequent command patterns.
- the retrieved patterns are compared against each other to identify the ones that are common among different users.
Prior to using GST, we conducted some preliminary analysis to identify the most frequently executed individual commands:
|Modeler||Move selected objects||Align references||Create a line|
4. Experiments and Results
We also filtered out all noisy data (e.g., “cancel the current operation”, “delete”) and entries related to errors. The processed data was stored in .csv format amounting to over 5.5 million records.
we constructed a GST data structure for each modeller’s command string to identify shared sequences:
- Several arbitrary minimum frequency threshold values were tested, among which we selected 250 and 500.
- The minimum length of extracted common command patterns was also set to three.
- The primitives extracted for minimum threshold of 250 are longer and represent meaningful modeling activities
- The obtained patterns became shorter when we increased the minimum frequency threshold. This observation was expected since longer sequences tend to match less frequently
5. Conclusion and Future Work
- The first phase is to identify the format in which different information items are stored.
- Clean the obtained data
- use GST data structures to find common command sequences among BIM users.
- First, we transform command sequences into character-based input vectors.
- the transformed data is utilized to construct GST
- Frequent command patterns are identified by conducting DFS on the trees
- Extracted patterns for different users are compared against each other to identify shared sequences.
The obtained results also confirmed our initial hypothesis that there are frequent command execution sequences shared among BIM modelers who work on similar projects
- it might be a good idea to modify the proposed approach to calculate the average time it takes BIM users to executed different command patterns.
- provide customized training to improve designers’ performance.
- calculated average times can help design manager to improve project outcomes by choosing an optimal team configuration
Jrn.MouseMove 0 , 713 , 566
0: mouse button state (1 = mouse button is down) 713: x coordinate measured from origin (upper left corner of the display window) 566: y coordinate