Sean's Blog

心之所向,身之所往。

The difference between data engineer and data scientist

Data Engineer vs Data Scientist: A Comparative Overview

Role Aspect Data Engineer Data Scientist
Primary Focus Building scalable data architectures and maintaining data pipelines. Extracting insights, identifying patterns, and building predictive models.
Responsibilities
  • Designing systems for large-scale data handling.
  • Streamlining data acquisition.
  • Ensuring data quality and integrity.
  • Mining data for patterns and trends.
  • Applying statistical models.
  • Building machine learning-based predictive models.
Tools and Technologies
  • Databases: SQL, NoSQL
  • Processing Frameworks: Apache Spark, Hive, Flink, Kafka
  • Scheduling: Apache Airflow, Oozie, Luigi
  • Cloud Platforms: AWS, Azure, GCP
  • Programming: Python, Java, Scala
  • Programming: Python, R
  • Visualization: Tableau, Power BI, Matplotlib, Seaborn
  • ML Frameworks: TensorFlow, PyTorch
  • Big Data: Hadoop, Spark
  • Statistical Software: SAS, MATLAB
Skill Focus System design, data pipeline creation, and optimization. Data analysis, statistical modeling, and advanced machine learning.
Collaboration Role Provides the infrastructure and tools necessary for data scientists to perform their analyses effectively. Leverages the infrastructure to derive actionable insights and guide business decisions.

Conclusion

Data engineers and data scientists serve distinct but complementary roles in any data-driven organization. Engineers handle the foundational infrastructure, enabling scientists to focus on deriving valuable insights. Together, they drive the success of data initiatives.


How to invoke a method in the jar?

We can use URLClassLoader to load classes from a given path.

      URL myJar = new File("jar/LibraryA-1.0-SNAPSHOT.jar").toURI().toURL();
      URLClassLoader clsLoader = new URLClassLoader(
              new URL[] {myJar},
              this.getClass().getClassLoader()
      );
      Class<?> loadedClass = clsLoader.loadClass("com.sean.liba.Main");
      Method method = loadedClass.getDeclaredMethod("print");
      Object instance = loadedClass.newInstance();
      method.invoke(instance);

      // Output: Hello World!

Let’s look at other use cases. What if you have two jars, and liba.jar deppends on another class in the libb.jar?

Take the above example, the print method has a dependency on the com.sean.libb.Caculator class. If we dont’t change the code, and run it again, you will get an error immediately.


[Elasticsearch] Working with disjunction max query - dis_max

GET /_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "term": { "title": "iphone" } },
        { "term": { "body": "iphone" } }
      ],
      // "tie_breaker": 0.7
    }
  }
}

This is the official document written about dis_max:

Returns documents matching one or more wrapped queries, called query clauses or clauses.

If a returned document matches multiple query clauses, the dis_max query assigns the document the highest relevance score from any matching clause, plus a tie breaking increment for any additional matching subqueries.


[Elasticsearch] How to use minimum_should_match and operator with match query?

GET /_search
{
  "query": {
    "match": {
      "message": {
        "query": "this is a test yo",
        // "operator": "or"
      }
    }
  }
}

This is the Match query we see quite often when using ES. However, if you did specify an analyzer during mapping, the query “this is a test yo" will likely be tokenized into five terms ”this”, “is”, “a”, “test”, and “yo” in the search phases. And there is an implicit parameter operator, and its default value is “or”. This means, this query will look up the documents in the index, and whenever there is any term match in the message of a doc, then that it’s a match!


How To Get A List Of Sheet Names In Google Sheets With Script?

There’s no built-in formula to do this, and we have to write our own script with Google’s Apps Script to achieve the function.

  1. First, go to the Extensions → Apps Script.

  2. Second, write our own method getSheetsName in the Apps Script console, and save.

Code snippets:

function getSheetsName() {
var sheetNames = new Array()
var sheets = SpreadsheetApp.getActiveSpreadsheet().getSheets();
for (var i=0 ; i<sheets.length ; i++) {
sheetNames.push( [ sheets[i].getName() ] )
}
return sheetNames
}

Then go back to your sheet, and type the formula with the function name we just created in the Apps Script.