Welcome to Data Science Methodology 101 From Requirements to Collection Data Collection!
After the initial data collection is performed, an assessment by the data scientist takes
place to determine whether or not they have what they need.
As is the case when shopping for ingredients to make a meal, some ingredients might be
out of season and more difficult to obtain or cost more than initially thought.
In this phase the data requirements are revised and decisions are made as to whether or not
the collection requires more or less data.
Once the data ingredients are collected, then in the data collection stage, the data scientist
will have a good understanding of what they will be working with.
Techniques such as descriptive statistics and visualization can be applied to the data
set, to assess the content, quality, and initial insights about the data.
Gaps in data will be identified and plans to either fill or make substitutions will
have to be made.
In essence, the ingredients are now sitting on the cutting board.
Now let's look at some examples of the data collection stage within the data science methodology.
This stage is undertaken as a follow-up to the data requirements stage.
So now, let's look at the case study related to applying "Data Collection".
Collecting data requires that you know the source or, know where to find the data elements
that are needed.
In the context of our case study, these can include:
demographic, clinical and coverage information of patients,
provider information, claims records, as well as
pharmaceutical and other information related to all the diagnoses of the congestive heart
failure patients.
For this case study, certain drug information was also needed, but that data source was
not yet integrated with the rest of the data sources.
This leads to an important point: It is alright to defer decisions about unavailable data,
and attempt to acquire it at a later stage.
For example, this can even be done after getting some intermediate results from the predictive
modeling.
If those results suggest that the drug information might be important in obtaining a good model,
then the time to try to get it would be invested.
As it turned out though, they were able to build a reasonably good model without this
drug information.
DBAs and programmers often work together to extract data from various sources, and then
merge it.
This allows for removing redundant data, making it available for the next stage of the methodology,
which is data understanding.
At this stage, if necessary, data scientists and analytics team members can discuss various ways
to better manage their data, including automating certain processes in the database, so that
data collection is easier and faster.
Thanks for watching!