mmcdara

7891 Reputation

22 Badges

9 years, 56 days

MaplePrimes Activity


These are replies submitted by mmcdara

@Scot Gould  

In my experience, it is best to name the rows and columns as strings.
I would even say "it's more natural, logical, and conform to the idea of what a DatafFrame is aimed to".
As the columns and the rows of a DataFrame are supposed to represent classes, it would be odd that a class has the same name as one of its elements.

More of this the restriction "names as strings" introduces some safety and I agree with you Scot.


@eggoodaire

https://www.maplesoft.com/support/help/maple/view.aspx?path=DataFrame/Guide

The R software possesses exactly the same structure.
From http://uc-r.github.io/dataframes one can read
A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows.

I'm sure people who use to use Excel give names to columns and rows that different from the content of a cell

@Carl Love 

Yes and No.
Yes for "//" as a special meaning in Linux when it appears at the beginning of a path.
No for "//" as no special meaning when it appears within a path where it is just equivalent to "/".
Linux collapses all interior"//...//" into a single one "/"

restart:
a := currentdir():
M := ExcelTools:-Import(cat(a, "/A_sample.xlsx"));
M := ExcelTools:-Import(cat(a, "///A_sample.xlsx"));


b := StringTools:-SubstituteAll(a, "\/", "\/\/"):
M := ExcelTools:-Import(cat(b, "/////A_sample.xlsx"));

 

Sorry for not being able to verify this right now (this comes from my phone), but I think you should begin to replace all backslashes
 ( \ ) by forward slashes ( / ).

@Preben Alsholm 

Hi,
Why not this ?

f := (a, f, t) -> x -> add(a[i]*cos(f[i]*x+t[i]), i=1..numelems(a)):
plot(f(A, F, Theta)(x), x=-3..10);

Which has the advantage to be (IMO) more flexible

Theta2:=<0.2,0.3,2.5>;
plot(f(A, F, Theta2)(x), x=-3..10);

 

@segfault 

If you read carefully what I wrote, I am talking about converting an mw file to an mpl file "with Maple". And it takes very little time.
Maybe I misunderstood your problem and this is not what you are looking for. In which case I apologize for interfering in this discussion and confusing the issue.
In any case, nothing justifies your rudeness

@vs140580 

R

  • Function stepAIC from package MASS  (stepwise regression)
    • Searching from the best model in the AIC sense (Akaike Information Criterion) from lower model [1] (only intercept) to uppermodel=[1, X1, ..., X10] (complete polynomial of total degree 1 with 10 indeterminates)
       returns [1, X1, ..., X10].
      Thus steAIC = Ordinary Least Regression [OLS] (function lm in R)
       
    • Searching from the best model in the AIC sense (Akaike Information Criterion) from lower model [1] to  uppermodel=[1, X1, ..., X10, X1*X2, ...X9*10] (i,complete polynomial of total degree 2 with 10 indeterminates)
      as the best model returns 
       Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + 
          X10 + X4:X8 + X2:X5 + X5:X10 + X1:X9 + X6:X8 + X6:X9 + X4:X9 + 
          X2:X8 + X2:X3 + X3:X5 + X1:X2
      Still more complex model, fvery far from the 2-regressors or 3-regressors model yous expect.
       
  • Function cv.glmnet from package glmnet (optimal lambda parameter in LASSO regression)
    • The answer is a value of lambd around 10^(-6) (for tecall lambda=0 corresponds to OLS
      Thus LASSO will select as a set of regressors... [X1, ..., X10] itself.
      Maybe some normalization of the regressors could help selecting a more parcimonious model

Maple

    My implementation of LASSO (which is still go be improved) returns results of the same order than those of cv.glmnet: the LASSO model has betxeen 8 and 10 regressors (to get a single model very intensive computation using Cross-Validation [your Train & Test base randomly splitted in a large number of ways] that I have not develoipped).
    Note that cv.glmnet uses this strategy and returns a result almost instantly: without a clever coding a naive Maple procedure will take un unbareable time to run.

@vs140580 

I'm developping a Maple code for LASSO regression.
I've still a fex problems to fix.

I will use R (this will validate my implementation of LASSO in MAPLE). I will send you the results in 5 to 6 hours for I have something else do do right now

@vs140580 

Hi,
I missed your comment because it probably overlapped with my answer

A few comments :

  • we need to apply stepwise regression or lasso regression ... 
    None of these methods exists in Maple.
    I wrote a piece of code over a decade ago that does ridge regression (another possibility), I'll dig around and see if I can get my hands on it (actually, I'm fed up with implementing algorithms in Maple and prefer using R directly).
     
  • If no such model comes with two or three regressor variables even are certain number of number of run
    I supposed you consider only polynomial models of degree 1 (at most) in all regressors?
    Is the model y=x1+x1^2+x1*x2 a two regressors model for you?

    For information the best 3-regressors model is the 4th one (ranked in increasing values of the Akaike criterion) and has regressors [X3, X6, X9]
     
  • Step 4
    There is no difficulties at all to ExcelTools:-Export the data nor to generate drawings.

Subsidiary question: do you use any data analysis software (beyond the Excel statistics package maybe)?
If not, would you be interested in the results that R would provide on your data?

IMPORTANT
I made a huge mistake: I didn't see the response (Y) was in first column andI imagined that it was in rightmost one.
The correction is easy

XY   := Data[2..-1]:
N, P := LinearAlgebra:-Dimensions(XY):
P    := P-1:
N, P:

XY := XY[.., [$2..P+1, 1]]:  # add this line to put Y in rightmost column

Train_fraction := 0.7:
....

Here is the corrected file (you will see that Akaike criterion doesn't seem well suited here, chich probably pleads for the use of something more reliable.

Solution.mw

 @mz6687

I've just updated my answer with a procedure to find the solution in a more workable way.

@segfault 

I experienced the same issue when I decided to manage the development of the code I develop and use versioning methods.
The first step was to convert (as @Rouben Rostamian said) all the mw files (about 20 in 1D text) into mpl files by opening each if them and exporting them into mpl files.

Based on an operation that lasts 1 minute per file, this means that you will have done this job (admittedly not very pleasant) in 1 hour.
Writting a conversion script in some language will probably take longer.

@vs140580

As I told you before I don't know how familiar you are with these questions.
Whatever here is an illustration for a limited (10) number of regressors.

(larger numbers require specific methods,  such as "stepwise regression", to explore efficiently a family of potential models)

Analyse this worksheet, try to understand what I did and if you face any difficulty do not hesitate to contact me.

The best model I obtain results from a balance between the quality of prediction (assessed on the test base) and parsimony (number of terms,or parameters, it contains).
As a thumb of rule the  quality of prediction varies inversely of the parsimony (quite intuitive isn't it?)

TrainTest_and_ModelSelection_2.mw

restart:

with(Statistics):

 

Skip this if you read XY from some data file

 

# Simulation of the data you could read with
# XY := ExcelTools(...)
#
# Part one : regressors (aka "inputs")

N := 200:
P := 10:
X := LinearAlgebra:-RandomMatrix(N, P, generator=-1.0 .. 1.0):

# Part two : response (aka "output", assumed to be scalar)
#
# step 1 choose a model X --> Y (I will "forget" it later)


regressors := [seq(V__||p, p=1..P)]:
TrueModel  := randpoly(regressors, degree=1, homogeneous, coeffs=proc() combinat:-randperm([0, 1, 10])[1] end proc)
              +
              randpoly(regressors, degree=2, homogeneous, coeffs=proc() combinat:-randperm([0$3, 0.1$2])[1] end proc)
              ;

TrueModel_coeff := coeffs(TrueModel);
TrueModel_base  := [op(TrueModel)] /~ TrueModel_coeff ;
TrueModel := unapply(TrueModel, regressors);



 

10*V__1+V__2+10*V__4+V__8+10*V__9+10*V__10+.1*V__2*V__6+.1*V__9^2+.1*V__4*V__6

 

.1, .1, 10, 1, 10, 1, 10, 10, .1

 

[100.0000000*V__1, 0.1e2*V__2, V__4, V__8, V__9, 10*V__10, 0.1000000000e-1*V__2*V__6, 0.1000000000e-1*V__9^2, 1.000000000*V__4*V__6]

 

proc (V__1, V__2, V__3, V__4, V__5, V__6, V__7, V__8, V__9, V__10) options operator, arrow; 10*V__1+V__2+10*V__4+V__8+10*V__9+10*V__10+.1*V__2*V__6+.1*V__9^2+.1*V__4*V__6 end proc

(1)

# step 2 build Y = TrueModel(X)

Y := Vector(N, i -> TrueModel(entries(X[i], nolist))):

# Part three: XY = < X | Y>.
 
XY := <X|Y>

XY := Vector(4, {(1) = ` 200 x 11 `*Matrix, (2) = `Data Type: `*anything, (3) = `Storage: `*rectangular, (4) = `Order: `*Fortran_order})

(2)

 

Real stuff begins here

 

 

Train_fraction := 0.4:
Train_N        := round(N*Train_fraction);
mixing         := combinat:-randperm([$1..N]):
Train_num      := mixing[1..Train_N]:
Test_num       := mixing[Train_N+1..N]:

Train_XY       := XY[Train_num];
Test_XY        := XY[Test_num];

Train_N := 80

 

Train_XY := Vector(4, {(1) = ` 80 x 11 `*Matrix, (2) = `Data Type: `*anything, (3) = `Storage: `*rectangular, (4) = `Order: `*Fortran_order})

 

Matrix(%id = 18446744078275883966)

(3)

# Define a family of models the trus, but now assumed unknown true model, is
# supposed to belong to.
#
# For the sake of simplicity one can assume this family is made of order 1 polynomials
# in 1 or 2... or P=10 regressors.
# This family has 2^P = 1024 elements.
#
# Scan all models is still possible, but if P is larger (you talked about values around 50)
# or if the family is more complex (polunomials of total degree = 2 for instance), the
# combinatorics besomes huge and very specific meethods Maple doesn't posseses are required
# to make the model selection efficient.
#
# Family_code codes for its members: the nth member is a first order polynomial whose
# indeterminates are such that Family_codes[n, i]=1 for i=1..P.


Family_code := convert(
                 map(
                   n -> parse~(
                          StringTools:-Explode(sprintf(cat("%.", P, "d"), convert(n, binary)))
                        )
                        , [$1..2^P-1]
                 )
                 , Matrix
               ):

# Constructor of bases of regressors

Member_base := n  -> [op(Family_code[n] . <regressors>)]:

# Two examples

Family_code[1]  , Member_base(1);
Family_code[123], Member_base(123);

Vector[row](10, {(1) = 0, (2) = 0, (3) = 0, (4) = 0, (5) = 0, (6) = 0, (7) = 0, (8) = 0, (9) = 0, (10) = 1}), [`#msub(mi("V"),mi("10"))`]

 

Vector[row](%id = 18446744078502619006), [V__4, V__5, V__6, V__7, V__9, V__10]

(4)

# column extractor

Extractor := n -> [ ListTools:-SearchAll(1, convert(Family_code[n], list)) ]:

# Two examples

Family_code[1]  , Extractor(1);
Family_code[123], Extractor(123);

Vector[row](10, {(1) = 0, (2) = 0, (3) = 0, (4) = 0, (5) = 0, (6) = 0, (7) = 0, (8) = 0, (9) = 0, (10) = 1}), [10]

 

Vector[row](%id = 18446744078502620686), [4, 5, 6, 7, 9, 10]

(5)

# One example of regression
#
# step 1 : build the vector of coefficients

Member_res := LinearFit(
                Member_base(123)
                , Train_XY[.., Extractor(123)]
                , Train_XY[.., P+1]
                , Member_base(123)
                , output=[parametervector, residualstandarddeviation]
              ):
Member_coeff       := Member_res[1];
Member_Train_error := Member_res[2];

# step 2 : Use Member_coeff to predict over the test base

Member_pred := Test_XY[.., Extractor(123)] . Member_coeff;

# step 3 : compare the test base output to the test predictions

ScatterPlot(Test_XY[.., P+1], Member_pred)

Member_coeff := Vector(6, {(1) = 11.3893346874433, (2) = -2.34552597739591, (3) = -.344353343418147, (4) = -.305263808356304, (5) = 11.7011863008406, (6) = 9.92783883220806})

 

Typesetting:-mrow(Typesetting:-mi("Member_Train_error", italic = "true", mathvariant = "italic"), Typesetting:-mo(":=", mathvariant = "normal", fence = "false", separator = "false", stretchy = "false", symmetric = "false", largeop = "false", movablelimits = "false", accent = "false", lspace = "0.2777778em", rspace = "0.2777778em"), Typesetting:-mn("6.24311415233925", mathvariant = "normal"))

 

Vector[column](%id = 18446744078502627078)

 

 

 

Find the model within the family which minimizes... what ???

Obviously the model which will minimize the resirual sum of squares (RSS) between the test base
responses and the predictions is the full model with all the regressors accounted for.

This model has "complexity" P.
Generally (it would take too much time to explain you why if you are not familiar with the topic),
one consider has the "best model" the one which realizes some balance between RSS and the complexity.

I choose here the Akaike criterion (AIC):

AIC = 2*NI - 2*log(L)  

where :
      NI = number of indeterminates of the model,
      L = maximum likelihood

 

Family_res := Matrix(2^P-1, 3):

for n from 1 to 2^P-1 do
  Member_coeff := LinearFit(
                    Member_base(n)
                    , Train_XY[.., Extractor(n)]
                    , Train_XY[.., P+1]
                    , Member_base(n)
                    , output=parametervector
                  ):

  Member_RSS := Variance(Test_XY[.., P+1] - Test_XY[.., Extractor(n)] . Member_coeff);

  Family_res[n, 1] := numelems(Member_base(n));
  Family_res[n, 2] := Member_RSS;
  Family_res[n, 3] := Variance(Train_XY[.., P+1] - Train_XY[.., Extractor(n)] . Member_coeff);

  if n mod 10 = 0 then printf("%d  %a\n", n, Member_RSS) end if;
end do:

   

# Assuming a gaussian likelihood L, the term ln(L) simplifies to Family_res[.., 2]

Family_AIC := 0*Family_res[.., 1] + Family_res[.., 2]/~Family_res[.., 3]:

ScatterPlot(
  Family_res[.., 1]
  , Family_res[.., 2]/~Family_res[.., 3]
  , symbol=box
  , symbolsize=10
  , labels=["Complexity", "RSS"]
);

print():print():

ScatterPlot(
  Vector(2^P-1, i -> i)
  , Family_AIC
  , symbol=solidcircle
  , symbolsize=5
  , size=[1200, 300]
  , labels=["Model num", "AIC"]
)

 

 

 

 

Family_5_best_num   := sort(Family_AIC, output=permutation)[1..5];
Family_5_best_model := Member_base~(Family_5_best_num)

[23, 279, 19, 275, 287]

 

[[V__6, V__8, V__9, V__10], [V__2, V__6, V__8, V__9, V__10], [V__6, V__9, V__10], [V__2, V__6, V__9, V__10], [V__2, V__6, V__7, V__8, V__9, V__10]]

(6)

# Observe Family_best_model contains the 4 most important terms of TrueModel


TrueModel(op(regressors))

10*V__1+V__2+10*V__4+V__8+10*V__9+10*V__10+.1*V__2*V__6+.1*V__9^2+.1*V__4*V__6

(7)

# The "best model"

q   := Family_5_best_num[1];
MOD := Family_5_best_models[1];

Member_coeff := LinearFit(
                  Member_base(q)
                  , Train_XY[.., Extractor(q)]
                  , Train_XY[.., P+1]
                  , Member_base(q)
                  , output=parametervector
                ):
`Best Model` = add(MOD[k]*%[k], k=1..numelems(MOD));


Member_pred := Test_XY[.., Extractor(q)] . Member_coeff:

`Residual variance` = Variance(Test_XY[.., P+1] - Test_XY[.., Extractor(q)] . Member_coeff);
`Residual std dev` = StandardDeviation(Test_XY[.., P+1] - Test_XY[.., Extractor(q)] . Member_coeff);
`Akaike criterion` = Family_AIC[q];

plots:-display(
  ScatterPlot(
    Test_XY[.., P+1]
    , Member_pred
    , labels=["Y (test base)", "Y (best prediction)"]
    , labeldirections=[default, vertical]
  )
  , plot([[min(Test_XY[.., P+1])$2], [max(Test_XY[.., P+1])$2]], color=red)
)

23

 

[V__6, V__8, V__9, V__10]

 

`Best Model` = HFloat(0.8354093551853539)*V__6+HFloat(0.7583311293323962)*V__8+HFloat(10.804152162622103)*V__9+HFloat(9.699987606490172)*V__10

 

`Residual variance` = HFloat(45.768157987824665)

 

`Residual std dev` = HFloat(6.765216773158468)

 

`Akaike criterion` = HFloat(0.546026872478428)

 

 

# Let's compare the previous results to the performances of the
# most complex model (number 2^P-1).
#
# As said previously this model has the lowest RSS, but given its complexity it
# appears not to be better then the one whcich minimizes the Akaike criterion.

q := 2^P-1:


Member_coeff := LinearFit(
                  Member_base(q)
                  , Train_XY[.., Extractor(q)]
                  , Train_XY[.., P+1]
                  , Member_base(q)
                  , output=parametervector
                ):
`Best Model` = add(Member_base(q)[k]*%[k], k=1..numelems(Member_base(q)));


Member_pred := Test_XY[.., Extractor(q)] . Member_coeff:

`Residual variance` = Variance(Test_XY[.., P+1] - Test_XY[.., Extractor(q)] . Member_coeff);
`Residual std dev` = StandardDeviation(Test_XY[.., P+1] - Test_XY[.., Extractor(q)] . Member_coeff);
`Akaike criterion` = Family_AIC[q];

plots:-display(
  ScatterPlot(
    Test_XY[.., P+1]
    , Member_pred
    , labels=["Y (test base)", "Y (best prediction)"]
    , labeldirections=[default, vertical]
  )
  , plot([[min(Test_XY[.., P+1])$2], [max(Test_XY[.., P+1])$2]], color=red)
)

`Best Model` = HFloat(10.001550480666388)*V__1+HFloat(0.9927300601057709)*V__2-HFloat(0.014916247655736804)*V__3+HFloat(10.008052487377197)*V__4-HFloat(0.009053049364341394)*V__5-HFloat(7.27626675518388e-4)*V__6+HFloat(0.014239766760354938)*V__7+HFloat(0.9967655532938859)*V__8+HFloat(9.993577323334744)*V__9+HFloat(9.982573892912415)*V__10

 

`Residual variance` = HFloat(0.0030934596106570573)

 

`Residual std dev` = HFloat(0.05561887818589168)

 

`Akaike criterion` = HFloat(1.167024089819541)

 

 

 

Download TrainTest_and_ModelSelection_2.mw

@vs140580 

So this is not Train and Test base that you need but something else generically named "model competition".

If I correctly understant yu have a set X of scalar egressors, a scalar response Y and a family F={F[i] : X -> Y, i in some countable or not set} of models.
You want to find the best (in what sense?) model F* in F, such that  || Ytest - F*(Xtest) || is minimum for a suitable norm ||.||.

If I'm mistaken explain yourself more clearly.
If I'm right provide a set od data with explanaroty material.

The problem as a whole is that vast that there are many strategies that I could discuss for hours, so be specific.

An illustration is provided in my answer 
https://www.mapleprimes.com/questions/235524-Give-A-Data-For-Regression-In-Excel#answer291716

@sand15 is my profesionnal login.
I'm answering now from home

As I don't have Excel data, the first part of the worksheet is aimed to emulate them.

Download Train_and_Test.mw


Great Job.
I had thought about it for a while but finally gave up

@pauldaas

In case you would be interested in the plot alone, this procedure accounts for "undefined" points , that is values of beta for wich eq(beta) is of the form 0=0:

restart:

# This procedure doesn(t handle the case of complex roots

SOLplot := proc(f, R, {n::positive:=100})
  local r, e, s, pts, disc, plo:
  pts  := NULL:
  disc := NULL:
  plo  := NULL:
  for r from op(1, R) to op(2, R) by (op(2, R)-op(1, R))/(n-1) do
    e := eval(f, beta=r):
    if has(e, xi) then
      s   := select(is, [solve(e, xi)], RealRange(Open(0), Open(1))):
      pts := pts, seq([r, i], i in s):
    else
      plo  := plo, plot([pts], style=line, view=[R, default]):
      pts  := NULL:
      disc := disc, r:
    end if:
  end do:
  plo  := plo, plot([pts], style=line, view=[R, default]):
  return [pts], [disc], plots:-display(plo)
end proc:

eq := [
  (1/6)*beta^3+(1/2)*xi^2*beta-(1/2)*beta^2-(1/2)*xi^2+(1/3)*beta = 0,
  (1/6)*beta*(beta^2+3*xi^2-6*xi+2) = 0
];

[(1/6)*beta^3+(1/2)*xi^2*beta-(1/2)*beta^2-(1/2)*xi^2+(1/3)*beta = 0, (1/6)*beta*(beta^2+3*xi^2-6*xi+2) = 0]

(1)

pts, disc, p := SOLplot(eq[1], 0..2, n=101):

# "discontinuity" points
disc

[1]

(2)

# "discontinuity" points
plots:-display(p, gridlines=true)

 

pts, disc, p := SOLplot(eq[2], -1..1, n=101):

# "discontinuity" points
disc

[0]

(3)

# "discontinuity" points
plots:-display(p, gridlines=true)

 

 

Download PlotAlone.mw

First 58 59 60 61 62 63 64 Last Page 60 of 154